论文标题
计算术语共发生频率的可扩展方法
Scalable Methods for Calculating Term Co-Occurrence Frequencies
论文作者
论文摘要
搜索技术可利用基本信息,例如术语频率和相似度加权计算中的文档长度。他们还可以利用更丰富的统计数据,尤其是任何两个条款共同发生的文档数量。在本文中,我们提出了用于计算此统计数据的替代方法,这是一项艰巨的任务,因为不同的术语对数量很大 - 例如,在典型的1000字新闻文章中约为100,000。相比之下,我们不采用近似算法,因为我们希望能够找到确切的计数。我们探索了它们的效率,发现基于字典的幼稚方法确实非常慢,而基于倒置索引和线性扫描的组合的方法既可以提供巨大的加速和更好地观察到的渐近行为。我们仔细的实施表明,通过我们的新颖列表方法,可以每小时处理超过数十万个文档。
Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -- around 100,000 in a typical 1000-word news article, for example. In contrast, we do not employ approximation algorithms, as we want to be able to find exact counts. We explore their efficiency, finding that a naïve approach based on a dictionary is indeed very slow, while methods based on a combination of inverted indexes and linear scanning provide both massive speed-ups and better observed asymptotic behaviour. Our careful implementation shows that, with our novel list-pairs approach it is possible to process over several hundred thousand documents per hour.