语义和解释：为什么反事实解释在深层神经网络中产生对抗性示例

论文标题

语义和解释：为什么反事实解释在深层神经网络中产生对抗性示例

Semantics and explanation: why counterfactual explanations produce adversarial examples in deep neural networks

论文作者

Browne, Kieran, Swift, Ben

论文摘要

可解释的AI的最新论文为反事实模式提供了令人信服的案例。尽管反事实解释在某些情况下似乎非常有效，但它们正式等同于对抗性例子。这为解释性研究人员带来了明显的悖论：如果这两个程序正式等效，哪些解释性划分在反事实解释和对抗性示例之间明显？我们通过重点放在反事实表达的语义上来解决这一悖论。对深度学习系统产生令人满意的解释将需要我们找到解释深神经网络中隐藏层表示的语义的方法。

Recent papers in explainable AI have made a compelling case for counterfactual modes of explanation. While counterfactual explanations appear to be extremely effective in some instances, they are formally equivalent to adversarial examples. This presents an apparent paradox for explainability researchers: if these two procedures are formally equivalent, what accounts for the explanatory divide apparent between counterfactual explanations and adversarial examples? We resolve this paradox by placing emphasis back on the semantics of counterfactual expressions. Producing satisfactory explanations for deep learning systems will require that we find ways to interpret the semantics of hidden layer representations in deep neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题