论文标题
构建多跳质量检查数据集以全面评估推理步骤
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
论文作者
论文摘要
一个多跳的问题回答(QA)数据集旨在通过要求模型阅读多个段落来回答给定问题来测试推理和推理技能。但是,当前数据集并未为从问题到答案的推理过程提供完整的解释。此外,先前的研究表明,现有的多跳数据集中的许多示例不需要多跳的推理来回答问题。在这项研究中,我们提出了一个新的多跳质量检查数据集,称为2Wikimultihopqa,该数据集使用结构化和非结构化数据。在我们的数据集中,我们介绍了包含多跳问题推理路径的证据信息。证据信息有两个好处:(i)为预测提供全面的解释,并(ii)评估模型的推理技能。我们在生成一个提问对的时,仔细设计了一条管道和一组模板,该问题可以保证多跳步骤和问题的质量。我们还利用Wikidata中的结构化格式,并使用逻辑规则来创建自然的问题,但仍需要多跳的推理。通过实验,我们证明了我们的数据集对于多跳模型中的挑战,并确保需要多跳的推理。
A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.