“那是一个可疑的反应！”：解释逻辑变化以检测NLP对抗性攻击

论文标题

“那是一个可疑的反应！”：解释逻辑变化以检测NLP对抗性攻击

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

论文作者

Mosca, Edoardo, Agarwal, Shreyash, Rando, Javier, Groh, Georg

论文摘要

对抗性攻击是当前机器学习研究面临的重大挑战。这些故意精心设计的投入即使是最先进的模型，也排除了它们在安全至关重要的应用中的部署。已经进行了广泛的计算机视觉研究，以制定可靠的防御策略。但是，在自然语言处理中，同一问题仍然少得多。我们的工作提出了对抗性文本示例的模型 - 不合Snostic探测器。该方法在扰动输入文本时标识目标分类器的逻辑中的模式。拟议的检测器改善了识别对抗性输入的当前最新性能，并在不同的NLP模型，数据集和单词级攻击方面表现出强大的概括能力。

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题