猎犬的语言模型可以推理吗？猎犬和语言模型之间的责备游戏

论文标题

猎犬的语言模型可以推理吗？猎犬和语言模型之间的责备游戏

Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

论文作者

BehnamGhader, Parishad, Miret, Santiago, Reddy, Siva

论文摘要

用猎犬增强预审计的语言模型已经在有效解决常见的NLP问题（例如语言建模和问题答案）方面表现出了希望。在本文中，我们评估了流行的回猎犬启动语言模型的优势和劣势，即KNN-LM，REALM，DPR + FID，CONTRIEVER + ATLAS和CORNIEVER + FLAN-T5在跨不同任务的陈述中进行推理。我们的发现表明，猎犬使用的简单相似性度量不足以检索所有必要的推理陈述。此外，即使仅提供所需的陈述，语言模型也不会表现出强烈的推理。此外，当与不完美的检索器结合使用时，语言模型的性能变得更糟，例如，使用Contriever检索5个语句时Flan-T5的性能下降了28.6％。尽管较大的语言模型可以提高性能，但仍有大量的增强空间。我们的进一步分析表明，对于GPT-3.5（例如GPT-3.5）的大型语言模型，MultiHop检索和阅读是有希望的，但并未推广到其他语言模型（例如Flan-T5-XXL）。

Augmenting pretrained language models with retrievers has shown promise in effectively solving common NLP problems, such as language modeling and question answering. In this paper, we evaluate the strengths and weaknesses of popular retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD, Contriever + ATLAS, and Contriever + Flan-T5, in reasoning over retrieved statements across different tasks. Our findings indicate that the simple similarity metric employed by retrievers is insufficient for retrieving all the necessary statements for reasoning. Additionally, the language models do not exhibit strong reasoning even when provided with only the required statements. Furthermore, when combined with imperfect retrievers, the performance of the language models becomes even worse, e.g., Flan-T5's performance drops by 28.6% when retrieving 5 statements using Contriever. While larger language models improve performance, there is still a substantial room for enhancement. Our further analysis indicates that multihop retrieve-and-read is promising for large language models like GPT-3.5, but does not generalize to other language models like Flan-T5-xxl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题