论文标题

有效的即时动态索引

Efficient Immediate-Access Dynamic Indexing

论文作者

Moffat, Alistair, Mackenzie, Joel

论文摘要

在动态检索系统中,必须在文档到达时摄入文档,并且可以通过查询立即找到。本文我们的目的是描述一个索引结构和处理制度,该结构适合立即访问的要求,寻求使摄入过程尽可能地精简,同时寻求使增长的指数尽可能小,并寻求通过索引通过索引提出效率,以便尽可能有效地进行基于术语的查询。我们描述了一种新的压缩操作和一种新颖的方法来扩展可扩展的列表,从而促进了三个目标。特别是,我们描述的结构提供了每次帖子的两个字节的增量文档级索引,而对于单词级索引只有少量。提供快速的文档插入;支持立即和连续的查询性;为快速连词查询和基于相似性得分的排名查询提供支持;并促进动态索引快速转换为“正常”的静态压缩的倒置索引结构。我们的新机制的测量确认,可以使用典型的服务器体系结构以两个千兆字节/分钟的速率构建到千兆字节范围内的内存动态文档级索引索引索引,即使在几个文档中,每个文档都可以在整个文档中均可解决多个型布尔的查询。每个存储的发布的平均值低至两个字节,不到最佳先前机制所需的空间一半。

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a "normal" static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting, less than half the space required by the best previous mechanism.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源