Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon
In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (π₯πππ) number β/β. It captures the tension between the limiting Shannon entropy (β) of the sketch and its normalized Fisher information (β), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the π―π’π²π sketch of Flajolet and Martin have π₯πππ number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as qββ. All other known sketches have even worse π₯πππ-numbers. We introduce a new sketch called π₯ππππππππΎπ that is based on a smoothed, compressed version of π―π’π²π with a different estimation function. π₯ππππππππΎπ has π₯πππ number H_0/I_0 β 1.98. It stores O(log^2log U) + (H_0/I_0)b β 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/β(b). π₯ππππππππΎπ's space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. π₯ππππππππΎπ can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the π₯πππ-number H_0/I_0 is a universal lower bound for any such composable sketch.
READ FULL TEXT