配对点：共同检查训练历史记录和测试刺激模型可解释性

论文标题

配对点：共同检查训练历史记录和测试刺激模型可解释性

Pair the Dots: Jointly Examining Training History and Test Stimuli for Model Interpretability

论文作者

Meng, Yuxian, Fan, Chun, Sun, Zijun, Hovy, Eduard, Wu, Fei, Li, Jiwei

论文摘要

来自模型的任何预测都是通过学习史和测试刺激的结合进行的。这为改善模型的解释性提供了重要的见解：{\ it，由于哪个部分的训练示例（s），该模型参与了测试示例的哪一部分}。不幸的是，解释模型预测的现有方法只能捕获测试刺激或学习历史的单个方面，而来自两者的证据从未合并或整合。在本文中，我们提出了一种有效且可区分的方法，以通过共同检查训练历史和测试刺激来解释模型的预测。测试刺激首先是通过基于梯度的方法识别的，表示模型参与的测试示例的一部分}。然后将基于梯度的显着性得分传播到使用影响功能的训练示例，以识别{\ IT哪个训练示例}的部分}使模型参与了测试刺激。该系统是可及时的，有效的：从基于梯度的方法中采用显着性得分，使我们能够通过测试刺激有效地追踪模型的预测，然后通过影响功能回到训练示例。我们证明，所提出的方法提供了有关神经模型决策的明确解释，以及对执行误差分析，制作对抗性示例和错误分类的示例有用。

Any prediction from a model is made by a combination of learning history and test stimuli. This provides significant insights for improving model interpretability: {\it because of which part(s) of which training example(s), the model attends to which part(s) of a test example}. Unfortunately, existing methods to interpret a model's predictions are only able to capture a single aspect of either test stimuli or learning history, and evidences from both are never combined or integrated. In this paper, we propose an efficient and differentiable approach to make it feasible to interpret a model's prediction by jointly examining training history and test stimuli. Test stimuli is first identified by gradient-based methods, signifying {\it the part of a test example that the model attends to}. The gradient-based saliency scores are then propagated to training examples using influence functions to identify {\it which part(s) of which training example(s)} make the model attends to the test stimuli. The system is differentiable and time efficient: the adoption of saliency scores from gradient-based methods allows us to efficiently trace a model's prediction through test stimuli, and then back to training examples through influence functions. We demonstrate that the proposed methodology offers clear explanations about neural model decisions, along with being useful for performing error analysis, crafting adversarial examples and fixing erroneously classified examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题