通过影响功能解释黑匣子预测并揭示数据工件

论文标题

通过影响功能解释黑匣子预测并揭示数据工件

Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions

论文作者

Han, Xiaochuang, Wallace, Byron C., Tsvetkov, Yulia

论文摘要

NLP的现代深度学习模型众所周知。这促使开发解释此类模型的方法，例如，通过基于梯度的显着性图或注意力重量的可视化。这种方法旨在通过突出相应输入文本中的重要单词来为特定模型预测提供解释。尽管这对于在输入中明确影响个人令牌影响的决策的任务可能很有用，但我们怀疑这种突出显示不适用于应该由更复杂的推理驱动模型决策的任务。在这项工作中，我们研究了NLP影响功能的使用，为解释神经文本分类器提供了另一种方法。影响功能通过识别有影响力的培训示例来解释模型的决策。尽管有这种方法的承诺，但在NLP的背景下，尚未对影响函数进行广泛的评估，NLP是这项工作解决的差距。我们在影响函数和对代表性任务的常见单词纳入方法之间进行了比较。由于怀疑，我们发现影响功能对于自然语言推论特别有用，其中“显着性地图”可能没有明确的解释。此外，我们基于影响功能，开发了一种新的定量措施，该措施可以揭示培训数据中的伪影。

Modern deep learning models for NLP are notoriously opaque. This has motivated the development of methods for interpreting such models, e.g., via gradient-based saliency maps or the visualization of attention weights. Such approaches aim to provide explanations for a particular model prediction by highlighting important words in the corresponding input text. While this might be useful for tasks where decisions are explicitly influenced by individual tokens in the input, we suspect that such highlighting is not suitable for tasks where model decisions should be driven by more complex reasoning. In this work, we investigate the use of influence functions for NLP, providing an alternative approach to interpreting neural text classifiers. Influence functions explain the decisions of a model by identifying influential training examples. Despite the promise of this approach, influence functions have not yet been extensively evaluated in the context of NLP, a gap addressed by this work. We conduct a comparison between influence functions and common word-saliency methods on representative tasks. As suspected, we find that influence functions are particularly useful for natural language inference, a task in which 'saliency maps' may not have clear interpretation. Furthermore, we develop a new quantitative measure based on influence functions that can reveal artifacts in training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题