论文标题
用草图的数据进行保形频率估计
Conformal Frequency Estimation with Sketched Data
论文作者
论文摘要
开发了一种灵活的共形推理方法,以基于这些数据的较小草图,在很大的数据集中为查询对象的频率构建置信区间。该方法是数据适应性的,不需要了解数据分布或草图算法的细节;相反,它在数据交换性的唯一假设下构建了有效的频繁置信区间。尽管我们的解决方案广泛适用,但本文侧重于涉及涉及count-min草图算法及其非线性变化的应用。通过模拟和实验SARS-COV-2 DNA序列和经典的英语文献,将表演与频繁主义者和贝叶斯替代方案的性能进行了比较。
A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under the sole assumption of data exchangeability. Although our solution is broadly applicable, this paper focuses on applications involving the count-min sketch algorithm and a non-linear variation thereof. The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.