重新评估自然语言的对抗例子

论文标题

重新评估自然语言的对抗例子

Reevaluating Adversarial Examples in Natural Language

论文作者

Morris, John X., Lifland, Eli, Lanchantin, Jack, Ji, Yangfeng, Qi, Yanjun

论文摘要

对NLP模型的最新攻击缺乏对成功攻击的共同定义。我们从过去的工作中提取想法为一个统一的框架：成功的自然语言对抗性示例是一种欺骗，欺骗了模型并遵循一些语言的约束。然后，我们分析了两个最先进的同义词替代攻击的输出。我们发现它们的扰动通常不能保留语义，而38％的人会引入语法错误。人类的调查表明，要成功地保留语义，我们需要显着增加交换单词的嵌入与原始句子和扰动句子的句子编码之间的最小余弦相似性。在调整后，以更好地保留语法和语法性，攻击成功率下降了70个百分点。

State-of-the-art attacks on NLP models lack a shared definition of a what constitutes a successful attack. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows some linguistic constraints. We then analyze the outputs of two state-of-the-art synonym substitution attacks. We find that their perturbations often do not preserve semantics, and 38% introduce grammatical errors. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences.With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题