自回家实体检索

论文标题

自回家实体检索

Autoregressive Entity Retrieval

论文作者

De Cao, Nicola, Izacard, Gautier, Riedel, Sebastian, Petroni, Fabio

论文摘要

实体是我们如何代表和汇总知识的核心。例如，Wikipedia之类的百科全书是由实体构成的（例如，Wikipedia文章一篇）。检索给出查询的实体的能力对于知识密集型任务（例如实体链接和开放域问题回答）至关重要。当前的方法可以理解为原子标签中的分类器，一个用于每个实体。它们的权重矢量是通过编码实体元信息（例如其描述）产生的密集实体表示。这种方法有几个缺点：（i）上下文和实体亲和力主要是通过向量点产品捕获的，可能会缺少细粒度的相互作用；（ii）在考虑大型实体集时，需要大量的内存足迹来存储密集的表示；（iii）必须在培训时间进行适当的否定数据集。在这项工作中，我们提出了类型，这是第一个通过以自动回归方式生成从左到右的独特名称来检索实体的系统。这减轻了上述技术问题，因为：（i）自回旋表述直接捕获上下文和实体名称之间的关系，有效地交叉编码两者；（ii）记忆足迹大大减少了，因为我们的编码器架构规模的参数具有词汇大小，而不是实体计数；（iii）计算软马克斯损耗的情况，而无需次采样负数据。我们尝试了20多个有关实体歧义，端到端实体链接和记录检索任务的数据集，在使用竞争系统的记忆足迹的一小部分时，实现了新的最新或非常竞争的结果。最后，我们证明可以通过简单地指定其名称来添加新实体。 https://github.com/facebookresearch/genre上的代码和预训练模型。

Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. Current approaches can be understood as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach has several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion. This mitigates the aforementioned technical issues since: (i) the autoregressive formulation directly captures relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the softmax loss is computed without subsampling negative data. We experiment with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their names. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题