文本对抗攻击的上下文化扰动

论文标题

文本对抗攻击的上下文化扰动

Contextualized Perturbation for Textual Adversarial Attack

论文作者

Li, Dianqi, Zhang, Yizhe, Peng, Hao, Chen, Liqun, Brockett, Chris, Sun, Ming-Ting, Dolan, Bill

论文摘要

对抗性例子暴露了自然语言处理（NLP）模型的脆弱性，可用于评估和改善其鲁棒性。产生此类示例的现有技术通常是由不可知情境不可知的本地启发式规则驱动的，通常会导致不自然和非语法输出。本文介绍了克莱尔（Clare），这是一种情境化的对抗示例生成模型，该模型通过面具 - 然后填充程序产生流利和语法输出。克莱尔（Clare）建立在预先训练的蒙版语言模型上，并以上下文感知的方式修改输入。我们提出了三个上下文化的扰动，替换，插入和合并，允许产生各种长度的输出。有了更丰富的可用策略，克莱尔能够以更少的编辑更有效地攻击受害者模型。广泛的实验和人类评估表明，克莱尔（Clare）在攻击成功率，文本相似性，流利性和语法性方面优于基准。

Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, allowing for generating outputs of varied lengths. With a richer range of available strategies, CLARE is able to attack a victim model more efficiently with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题