论文标题
基础估计的信息理论限制:费舍尔遇到香农
Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon
论文作者
论文摘要
估计大型多组合的心脏(不同元素数)是流式传输和素描的经典问题,可以追溯到Flajolet和Martin的经典概率计数(PCSA)算法(PCSA)算法。从1983年开始,我们从本文中研究了本质上的估计量与其估计误差的估算误差之间的内在交易。我们定义了一种名为Fisher-Shannon(Fish)的基数估计器的新效率量度,$ \ MATHCAL {H}/\ MATHCAL {I} $。它捕获了素描的限制香农熵($ \ MATHCAL {H} $)与其归一化Fisher信息($ \ Mathcal {I} $)之间的张力,该信息表征了统计上有效的,非偏见的估计器的差异。 我们的结果如下。 我们证明,所有基础 - $ q $ flajolet的变体和马丁的PCSA素描都有鱼类$ h_0/i_0 \约1.98016 $,并且每个基本$ q $ q $ q $ q $ q $(hyper)loglog的变体都比$ h_0/i_0 $ h_0/i_0 $ h_0 $ h_0 $ h_0/ifty $ h_0/ifty $ h_0/ifty,这里$ h_0,i_0 $是精确定义的常数。 我们描述了一个名为Fishmonger的草图,该草图基于具有不同估计函数的PCSA的平滑熵压缩变体。事实证明,fishmonger的概率很高,可以在$ [u] $的多种过程中处理,因此,它的空间始终为$ O(\ log^2 \ log u) +(1 + o(1))(h_0/i_0)b \ b \ b \ b \ of 1.98b $ lits,其标准误差为$ 1/\ sqrt。 我们提供间接证据表明,$ H_0/I_0 $是用于基数估算的最佳鱼类数字。我们定义了一类可线的草图,并证明此类没有任何成员可以击败$ h_0/i_0 $。实际上,流行的可合并草图也可以线性化。
Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching, dating back to Flajolet and Martin's classic Probabilistic Counting (PCSA) algorithm from 1983. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the random oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (Fish) number $\mathcal{H}/\mathcal{I}$. It captures the tension between the limiting Shannon entropy ($\mathcal{H}$) of the sketch and its normalized Fisher information ($\mathcal{I}$), which characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our results are as follows. We prove that all base-$q$ variants of Flajolet and Martin's PCSA sketch have Fish-number $H_0/I_0 \approx 1.98016$ and that every base-$q$ variant of (Hyper)LogLog has Fish-number worse than $H_0/I_0$, but that they tend to $H_0/I_0$ in the limit as $q\rightarrow \infty$. Here $H_0,I_0$ are precisely defined constants. We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. It is proved that with high probability, Fishmonger processes a multiset of $[U]$ such that at all times, its space is $O(\log^2\log U) + (1+o(1))(H_0/I_0)b \approx 1.98b$ bits and its standard error is $1/\sqrt{b}$. We give circumstantial evidence that $H_0/I_0$ is the optimum Fish-number of mergeable sketches for Cardinality Estimation. We define a class of linearizable sketches and prove that no member of this class can beat $H_0/I_0$. The popular mergeable sketches are, in fact, also linearizable.