用对比解释解释语言模型

论文标题

用对比解释解释语言模型

Interpreting Language Models with Contrastive Explanations

论文作者

Yin, Kayo, Neubig, Graham

论文摘要

模型可解释性方法通常用于解释NLP模型的决策，例如文本分类，其中输出空间相对较小。但是，当应用于语言生成时，输出空间通常由成千上万的令牌组成，这些方法将无法提供内容丰富的解释。语言模型必须考虑各种特征才能预测代币，例如其语音，数字，时态或语义的一部分。现有的解释方法将所有这些特征的证据混为一谈，这是单一的解释，这对人类的理解不太容易解释。为了解散语言建模中的不同决策，我们专注于对形式进行对比解释：我们寻找明显的输入令牌，以解释为什么该模型预测一个令牌而不是另一个令牌。我们证明，在验证主要的语法现象时，对比解释比非对抗性解释要好得多，并且它们显着改善了对人类观察者的对比模型的可模拟性。我们还确定了模型使用类似证据的对比决策组，我们能够表征输入令牌模型在各种语言生成决策中使用的方法。

Model interpretability methods are often used to explain NLP model decisions on tasks such as text classification, where the output space is relatively small. However, when applied to language generation, where the output space often consists of tens of thousands of tokens, these methods are unable to provide informative explanations. Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. To disentangle the different decisions in language modeling, we focus on explaining language models contrastively: we look for salient input tokens that explain why the model predicted one token instead of another. We demonstrate that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena, and that they significantly improve contrastive model simulatability for human observers. We also identify groups of contrastive decisions where the model uses similar evidence, and we are able to characterize what input tokens models use during various language generation decisions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题