相对误差流分位数

论文标题

相对误差流分位数

Relative Error Streaming Quantiles

论文作者

Cormode, Graham, Karnin, Zohar, Liberty, Edo, Thaler, Justin, Veselý, Pavel

论文摘要

在流数据上估算等级，分位数和分布是数据分析和监视的核心任务。给定配备了总订单的数据宇宙中的$ n $项目流，任务是计算$ n $的大小polygarithmic的草图（数据结构）。鉴于草图和查询项目$ y $，应该能够在流中近似其等级，即小于或等于$ y $的流元素数量。迄今为止，大多数工作都集中在添加$ \ varepsilon n $错误近似上，最终在KLL草图中达到了最佳的渐近行为。本文研究了乘法$（1 \ pm \ varepsilon）$ - 等级的错误近似值。乘法误差的实际动机源于了解分布的尾巴的要求，因此草图更准确地接近极端值。 $ O（\ log（\ varepsilon^2 n）/\ varepsilon^2）$或$ o（\ log^3（\ varepsilon n）/\ varepsilon）$ universe $ niverese $ o（\ varepsilon^2 n）/\ varepsilon^2）$ o（\ varepsilon^2 n）/\ varepsilon^2）$ o（\ varepsilon^2 n）/\ varepsilon^2）$ o（\ varepsilon^2 n）/\ varepsilon^2）$ o（\ varepsilon^2 n）/\ varepsilon^2）。我们提出一个随机的草图存储$ O（\ log^{1.5}（\ Varepsilon n）/\ Varepsilon）$项目，可以$（1 \ pm \ pm \ varepsilon）$ - 近似每个宇宙项目的等级，具有高常数概率；此空间绑定在$ o（\ sqrt {\ log（\ varepsilon n）}）$最佳因子内。我们的算法不需要对流长度的先验知识，并且是完全合并的，因此适用于并行和分布式计算环境。

Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of $n$ items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in $n$. Given the sketch and a query item $y$, one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to $y$. Most works to date focused on additive $\varepsilon n$ error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative $(1\pm\varepsilon)$-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either $O(\log(\varepsilon^2 n)/\varepsilon^2)$ or $O(\log^3(\varepsilon n)/\varepsilon)$ universe items. We present a randomized sketch storing $O(\log^{1.5}(\varepsilon n)/\varepsilon)$ items that can $(1\pm\varepsilon)$-approximate the rank of each universe item with high constant probability; this space bound is within an $O(\sqrt{\log(\varepsilon n)})$ factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题