泄漏调整后的模拟性：模型可以在自然语言中对其行为产生非平凡的解释吗？

论文标题

泄漏调整后的模拟性：模型可以在自然语言中对其行为产生非平凡的解释吗？

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

论文作者

Hase, Peter, Zhang, Shiyue, Xie, Harry, Bansal, Mohit

论文摘要

自然语言的数据收集（NL）理解任务越来越多地包括人类的解释以及数据点，使过去的作品能够介绍执行任务并为其输出生成NL解释的模型。然而，迄今为止，已经通过与人类解释的表面水平相似性评估了模型生成的解释，无论是通过BLEU和人类评估等自动指标。我们认为这些评估是不够的，因为它们未能表明解释是否支持实际模型行为（忠诚），而不是简单地匹配人类所说的话（合理性）。在这项工作中，我们解决了从模型的模型角度评估解释的问题。我们的贡献如下：（1）我们引入了用于评估NL解释的泄漏调整后的模拟性（LAS）度量，该指标衡量了解释如何帮助观察者预测模型的输出，同时控制解释如何直接泄漏输出。我们使用模型作为人类观察者的代理，并通过两个人类主题实验来验证这种选择。（2）使用COS-E和E-SNLI数据集，我们评估了两种现有的生成图形模型和两种新方法；一种合理的方法，我们引入了大致达到人类水平的LAS分数。（3）最后，我们将解释生成作为多代理游戏，并优化了可模拟性的解释，同时惩罚标签泄漏，这可以提高LAS分数。我们在本文中为实验提供代码

Data collection for natural language (NL) understanding tasks has increasingly included human explanations alongside data points, allowing past works to introduce models that both perform a task and generate NL explanations for their outputs. Yet to date, model-generated explanations have been evaluated on the basis of surface-level similarities to human explanations, both through automatic metrics like BLEU and human evaluations. We argue that these evaluations are insufficient, since they fail to indicate whether explanations support actual model behavior (faithfulness), rather than simply match what a human would say (plausibility). In this work, we address the problem of evaluating explanations from the model simulatability perspective. Our contributions are as follows: (1) We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations, which measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We use a model as a proxy for a human observer, and validate this choice with two human subject experiments. (2) Using the CoS-E and e-SNLI datasets, we evaluate two existing generative graphical models and two new approaches; one rationalizing method we introduce achieves roughly human-level LAS scores. (3) Lastly, we frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage, which can improve LAS scores. We provide code for the experiments in this paper at https://github.com/peterbhase/LAS-NL-Explanations

下载PDF全文

下载文献需遵守相关版权规定

论文标题