单转的辩论无助于人类回答艰难的阅读理解问题

论文标题

单转的辩论无助于人类回答艰难的阅读理解问题

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

论文作者

Parrish, Alicia, Trivedi, Harsh, Perez, Ethan, Chen, Angelica, Nangia, Nikita, Phang, Jason, Bowman, Samuel R.

论文摘要

当前的质量保证系统可以在没有解释或证据的证据的情况下生成合理的听起来但错误的答案，当人类无法轻易检查模型的答案时，这尤其有问题。这给建立对机器学习系统的信任提出了挑战。我们从现实世界中汲取灵感，通过考虑对方方面的回答很难回答（见Irving等，2018）。对于多项选择的质量检查示例，我们在辩论式设置中为正确和错误的答案选项构建了一个单个参数的数据集，作为培训模型的第一步，以为两个候选答案提供解释。我们使用较长的上下文 - 熟悉上下文的人写了令人信服的解释，以预先选择的正确和错误的答案，我们测试这些解释是否允许那些尚未阅读完整上下文的人更准确地确定正确的答案。我们没有发现在设置中的解释提高了人类的准确性，但是基线条件表明，提供人类选择的文本片段确实提高了准确性。我们使用这些发现来提出改善为未来数据收集工作的辩论的方法。

Current QA systems can generate reasonable-sounding yet false answers without explanation or evidence for the generated answer, which is especially problematic when humans cannot readily check the model's answers. This presents a challenge for building trust in machine learning systems. We take inspiration from real-world situations where difficult questions are answered by considering opposing sides (see Irving et al., 2018). For multiple-choice QA examples, we build a dataset of single arguments for both a correct and incorrect answer option in a debate-style set-up as an initial step in training models to produce explanations for two candidate answers. We use long contexts -- humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers, and we test if those explanations allow humans who have not read the full context to more accurately determine the correct answer. We do not find that explanations in our set-up improve human accuracy, but a baseline condition shows that providing human-selected text snippets does improve accuracy. We use these findings to suggest ways of improving the debate set up for future data collection efforts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题