知识图基于知识增强语言模型预训练的基于知识图的合成语料库生成

论文标题

知识图基于知识增强语言模型预训练的基于知识图的合成语料库生成

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

论文作者

Agarwal, Oshin, Ge, Heming, Shakeri, Siamak, Al-Rfou, Rami

论文摘要

对数据到文本生成的先前工作，将知识图（KG）转换为自然文本的任务，重点是特定于域的基准数据集。但是，在本文中，我们对整个英语Wikidata kg进行了言语，并讨论了与广泛，开放式，大规模的言语相关的独特挑战。我们进一步表明，像Wikidata这样的全面，百科全书的口头表达可用于整合结构化的kg和自然语言语料库。与已开发的许多架构相反，我们的方法将kg转换为自然文本，从而将其无缝集成到现有语言模型中。它具有提高事实准确性和降低的语言模型毒性的进一步优势。我们通过在检索语言模型中增强检索语料库来评估这种方法，并在开放域QA和Lama知识调查的知识密集型任务上显示出显着改进。

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.

下载PDF全文

下载文献需遵守相关版权规定

论文标题