论文标题
易于决定,难以同意:减少显着性方法之间的分歧
Easy to Decide, Hard to Agree: Reducing Disagreements Between Saliency Methods
论文作者
论文摘要
揭示黑匣子NLP模型的一种流行方法是利用显着方法,该方法为每个输入组件分配标量重要性得分。评估可解释性方法是否忠实的一种常见实践是使用逐个评估 - 如果多种方法同意解释,则其信誉会提高。但是,最近的工作发现,即使应用于同一模型实例并主张使用替代诊断方法,显着性方法也会显示出较弱的等级相关性。在我们的工作中,我们证明了等级相关性并不适合评估协议,并认为Pearson- $ r $是一种更好的选择。我们进一步表明,提高注意力解释的忠诚的正规化技术也增加了显着方法之间的一致性。通过将我们的发现与基于培训动态的实例类别联系起来,我们表明,显着性方法的共识解释对于易于学习的实例非常低。最后,我们将跨实例类别的一致性的改进与实例的局部表示空间统计数据联系起来,为分析哪些内在模型属性的工作铺平了道路,可以改善其对可解释性方法的倾向。
A popular approach to unveiling the black box of neural NLP models is to leverage saliency methods, which assign scalar importance scores to each input component. A common practice for evaluating whether an interpretability method is faithful has been to use evaluation-by-agreement -- if multiple methods agree on an explanation, its credibility increases. However, recent work has found that saliency methods exhibit weak rank correlations even when applied to the same model instance and advocated for the use of alternative diagnostic methods. In our work, we demonstrate that rank correlation is not a good fit for evaluating agreement and argue that Pearson-$r$ is a better-suited alternative. We further show that regularization techniques that increase faithfulness of attention explanations also increase agreement between saliency methods. By connecting our findings to instance categories based on training dynamics, we show that the agreement of saliency method explanations is very low for easy-to-learn instances. Finally, we connect the improvement in agreement across instance categories to local representation space statistics of instances, paving the way for work on analyzing which intrinsic model properties improve their predisposition to interpretability methods.