论文标题
用于数据流中频率估计的双敲击算法
Double-Hashing Algorithm for Frequency Estimation in Data Streams
论文作者
论文摘要
元素的频率估计是总结数据流和机器学习应用程序的重要任务。通常通过使用带有sublinear空间数据结构的流算法来解决问题。这些算法允许在使用有限的数据存储时处理大数据。常用的流算法(例如Count-Min Sketch)具有许多优点,但不考虑数据流的属性以进行性能优化。在本文中,我们介绍了一种新型的双锤算法,该算法可根据给定流的特性来优化流媒体算法的灵活性。在双重敲击方法中,首先采用标准流算法来获得元素频率的估计。该估计值是使用流的一小部分得出的,并允许识别重击球手。接下来,它使用一个修改后的哈希表格,其中重击器被映射到单个水桶中,而其他流元素将映射到其余的水桶中。最后,使用任何流算法在整个数据流上构造的哈希表估算元素频率。我们在合成数据和Internet查询日志数据集上证明了我们的方法能够改善由于从哈希过程中删除重型击球手而引起的频率估计,从而减少了哈希表中的碰撞。我们的方法避免采用其他机器学习模型来识别重型击球手,从而降低算法的复杂性和简化实现。此外,由于它不依赖于识别重型击球手的流元素的特定特征,因此它适用于各种流。此外,我们提出了一个程序,讲述如何在流中元素的频率随时间变化时动态调整所提出的双锤算法。
Frequency estimation of elements is an important task for summarizing data streams and machine learning applications. The problem is often addressed by using streaming algorithms with sublinear space data structures. These algorithms allow processing of large data while using limited data storage. Commonly used streaming algorithms, such as count-min sketch, have many advantages, but do not take into account properties of a data stream for performance optimization. In the present paper we introduce a novel double-hashing algorithm that provides flexibility to optimize streaming algorithms depending on the properties of a given stream. In the double-hashing approach, first a standard streaming algorithm is employed to obtain an estimate of the element frequencies. This estimate is derived using a fraction of the stream and allows identification of the heavy hitters. Next, it uses a modified hash table where the heavy hitters are mapped into individual buckets and other stream elements are mapped into the remaining buckets. Finally, the element frequencies are estimated based on the constructed hash table over the entire data stream with any streaming algorithm. We demonstrate on both synthetic data and an internet query log dataset that our approach is capable of improving frequency estimation due to removing heavy hitters from the hashing process and, thus, reducing collisions in the hash table. Our approach avoids employing additional machine learning models to identify heavy hitters and, thus, reduces algorithm complexity and streamlines implementation. Moreover, because it is not dependent on specific features of the stream elements for identifying heavy hitters, it is applicable to a large variety of streams. In addition, we propose a procedure on how to dynamically adjust the proposed double-hashing algorithm when frequencies of the elements in a stream are changing over time.