第一阶段检索的复合代码稀疏自动编码器

论文标题

第一阶段检索的复合代码稀疏自动编码器

Composite Code Sparse Autoencoders for first stage retrieval

论文作者

Lassance, Carlos, Formal, Thibault, Clinchant, Stephane

论文摘要

我们提出了一个复合代码稀疏自动编码器（CCSA）方法，以基于暹罗伯特模型对文档表示的近似邻居（ANN）搜索。在信息检索（IR）中，排名管道通常分为两个阶段：第一阶段的重点是从整个集合中检索候选人。第二阶段通过依靠更复杂的模型来重新排列候选人。最近，Siamese-Bert模型已被用作第一阶段排名者，以取代或补充传统的单袋模型。但是，索引和搜索大型文档收集需要在密集的向量上有效的相似性搜索，这就是为什么ANN技术发挥作用的原因。由于复合代码自然稀疏，因此我们首先展示了CCSA如何通过均匀的正常使用器来学习有效的平行倒置索引。其次，CCSA可以用作二进制量化方法，我们建议将其与最新的基于图的ANN技术相结合。我们在MSMARCO数据集上的实验表明，CCSA的表现优于产品量化IVF。此外，CCSA二进制量化对索引大小和基于图的HNSW方法的内存使用量有益，同时保持了良好的召回和MRR水平。第三，我们将图像检索的最新监督量化方法进行比较，并发现CCSA能够胜过它们。

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary quantization method and we propose to combine it with the recent graph based ANN techniques. Our experiments on MSMARCO dataset reveal that CCSA outperforms IVF with product quantization. Furthermore, CCSA binary quantization is beneficial for the index size, and memory usage for the graph-based HNSW method, while maintaining a good level of recall and MRR. Third, we compare with recent supervised quantization methods for image retrieval and find that CCSA is able to outperform them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题