论文标题
超越模型的解释性:关于对比的文本解释的忠诚和对抗性鲁棒性
Beyond Model Interpretability: On the Faithfulness and Adversarial Robustness of Contrastive Textual Explanations
论文作者
论文摘要
对比解释方法超出了透明度,并解决了解释的对比方面。这种解释正在成为一种有吸引力的选择,可以为分类器的决定不利影响方案提供可行的更改。但是,它们扩展到文本数据的扩展不足,对它们的漏洞和局限性几乎没有研究。 这项工作通过为受解释的忠诚启发的新颖评估计划奠定基础,激发了文本反事实。因此,我们将三个指标,接近度,连接性和稳定性的计算扩展到文本数据,并根据建议的指标对两种成功的对比方法(Polyjuice and MICE)进行基准测试。情感分析数据的实验数据表明,反事实与原始对应物的连接性在这两个模型中都不明显。更有趣的是,生成的对比文本可以通过多朱奇来实现,这突出了潜在表示在反事实搜索中的重要性。最后,我们对文本追索方法进行了第一次语义对抗性攻击。结果证明了多朱奇的鲁棒性以及潜在输入表示在鲁棒性和可靠性中所起的作用。
Contrastive explanation methods go beyond transparency and address the contrastive aspect of explanations. Such explanations are emerging as an attractive option to provide actionable change to scenarios adversely impacted by classifiers' decisions. However, their extension to textual data is under-explored and there is little investigation on their vulnerabilities and limitations. This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations. Accordingly, we extend the computation of three metrics, proximity,connectedness and stability, to textual data and we benchmark two successful contrastive methods, POLYJUICE and MiCE, on our suggested metrics. Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models. More interestingly, the generated contrastive texts are more attainable with POLYJUICE which highlights the significance of latent representations in counterfactual search. Finally, we perform the first semantic adversarial attack on textual recourse methods. The results demonstrate the robustness of POLYJUICE and the role that latent input representations play in robustness and reliability.