论文标题
wikiwhy:回答和解释因果问题
WikiWhy: Answering and Explaining Cause-and-Effect Questions
论文作者
论文摘要
随着大型语言模型(LLM)的增长更大,更复杂,评估其自然语言中的“推理”能力变得越来越具有挑战性。试图评估推理的最新问题回答(QA)基准通常受到涵盖的情况和主题的范围狭窄的限制。我们介绍了Wikiwhy,这是一个围绕新颖的辅助任务构建的质量检查数据集:解释为什么答案在自然语言中是真实的。 Wikiwhy含有9,000多个“为什么”问答理性的三元组,以维基百科事实为基础,遍布各种各样的主题。每个理由是将问题与答案联系起来的一组支持语句。 Wikiwhy为LLMS的推理能力提供了基准,因为它要求每个答案进行严格的明确理由,以证明对隐性常识知识的获取,这不太可能很容易记住。 GPT-3基线在端到端的答案和解释条件下仅获得38.7%的人为评估的正确性,为将来的改进留出了很大的空间。
As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.