用于文件检索的神经语料库索引器

论文标题

用于文件检索的神经语料库索引器

A Neural Corpus Indexer for Document Retrieval

论文作者

Wang, Yujing, Hou, Yingyan, Wang, Haonan, Miao, Ziming, Wu, Shibin, Sun, Hao, Chen, Qi, Xia, Yuqing, Chi, Chengmin, Zhao, Guoshuai, Liu, Zheng, Xie, Xing, Sun, Hao Allen, Deng, Weiwei, Zhang, Qi, Yang, Mao

论文摘要

当前的最新文档检索解决方案主要遵循索引 - 重新划分范式，在该范围内，很难为最终检索目标直接优化该索引。在本文中，我们的目的是表明，终端深度神经网络统一培训和索引阶段可以显着提高传统方法的召回性能。为此，我们提出了神经语料库索引器（NCI），这是一个直接为指定查询生成相关文档标识符的序列到序列网络。为了优化NCI的召回性能，我们发明了一个前缀感知的重量自适应解码器体系结构，并利用量身定制的技术，包括查询生成，语义文档标识符和基于一致性的正则化。实证研究表明，与最佳基线方法相比，NCI在NQ320K数据集上的回忆@1和triviaqa数据集上的R-PRECIS在两个常用的学术基准上的优越性，与最佳基线方法相比，回忆@1的相对增强相对增强。

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题