论文标题
迈向忠诚理论:忠实的解释可区分分类器超越连续数据
Towards a Theory of Faithfulness: Faithful Explanations of Differentiable Classifiers over Continuous Data
论文作者
论文摘要
文献中有广泛的共识,解释方法应该忠于他们解释的模型,但忠诚仍然是一个含糊的术语。我们在连续数据的背景下重新审视忠诚,并提出了两个对特征归因方法的忠诚的形式定义。定性忠实要求得分的要求反映了该特征对模型的真实定性效应(正面与负面)以及分数的幅度反映了真正的定量效应。我们讨论在哪些条件下可以满足这些要求的程度(本地与全球)。作为概念思想的应用,我们在连续数据上查看可区分的分类器,并表征梯度得分如下:每种定性忠实的特征归因方法在质量上等同于梯度分数。此外,如果归因方法是定量忠诚的,因为分类器的输出的变化与特征分数成正比,那么它要么与梯度分配等同,要么基于分类器的下近似值。为了说明该理论的实际相关性,我们在实验上证明了流行的归因方法可能无法在数据连续和分类器可区分的环境中给出忠实的解释。
There is broad agreement in the literature that explanation methods should be faithful to the model that they explain, but faithfulness remains a rather vague term. We revisit faithfulness in the context of continuous data and propose two formal definitions of faithfulness for feature attribution methods. Qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model and quanitative faithfulness that the magnitude of scores reflect the true quantitative effect. We discuss under which conditions these requirements can be satisfied to which extent (local vs global). As an application of the conceptual idea, we look at differentiable classifiers over continuous data and characterize Gradient-scores as follows: every qualitatively faithful feature attribution method is qualitatively equivalent to Gradient-scores. Furthermore, if an attribution method is quantitatively faithful in the sense that changes of the output of the classifier are proportional to the scores of features, then it is either equivalent to gradient-scoring or it is based on an inferior approximation of the classifier. To illustrate the practical relevance of the theory, we experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous and the classifier differentiable.