我们对多项选择质量检查系统有什么期望？

论文标题

我们对多项选择质量检查系统有什么期望？

What do we expect from Multiple-choice QA Systems?

论文作者

Shah, Krunal, Gupta, Nitish, Roth, Dan

论文摘要

机器学习系统在各种质量检查数据集上的最新成功可以解释为模型的语言理解能力的重大改进。但是，使用各种扰动，最近的多项工作表明，数据集上的良好性能可能不会表明与“理解”语言的模型的期望相关的性能。在这项工作中，我们考虑了几个多项选择问题答案（MCQA）数据集的最佳性能模型，并使用该模型输入的一系列零信息扰动来对其对这种模型的期望进行评估。我们的结果表明，该模型显然没有我们的期望，并激发了一种修改的训练方法，迫使该模型更好地参与了投入。我们表明，新的培训范式导致了一个模型，该模型与原始模型相当，同时可以更好地满足我们的期望。

The recent success of machine learning systems on various QA datasets could be interpreted as a significant improvement in models' language understanding abilities. However, using various perturbations, multiple recent works have shown that good performance on a dataset might not indicate performance that correlates well with human's expectations from models that "understand" language. In this work we consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets, and evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs. Our results show that the model clearly falls short of our expectations, and motivates a modified training approach that forces the model to better attend to the inputs. We show that the new training paradigm leads to a model that performs on par with the original model while better satisfying our expectations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题