论文标题

咬伤:迭代触发注入的文本后门攻击

BITE: Textual Backdoor Attacks with Iterative Trigger Injection

论文作者

Yan, Jun, Gupta, Vansh, Ren, Xiang

论文摘要

后门攻击已成为对NLP系统的新兴威胁。通过提供中毒的训练数据,对手可以将“后门”嵌入受害者模型中,该模型允许将满足某些文本模式(例如,包含关键字)的输入实例预测为对手选择的目标标签。在本文中,我们证明可以设计既隐身(即难以置信)又有效(即具有很高的攻击成功率)的后门攻击。我们提出了Bite,这是一种后门攻击,可以毒化训练数据,以在目标标签和一组“触发单词”之间建立牢固的相关性。这些触发词通过自然单词级扰动迭代识别并注入目标标签实例。中毒的训练数据指示受害者模型预测包含触发词的输入的目标标签,形成后门。四个文本分类数据集的实验表明,我们提出的攻击比基线方法有效得多,同时保持体面的隐身性,从而引起对不受信任的培训数据的使用警报。我们进一步提出了一种基于潜在的触发单词删除的辩护方法,该方法的表现优于现有方法,以防御咬合,并概括地处理其他后门攻击。

Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a "backdoor" into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary's choice. In this paper, we demonstrate that it is possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and a set of "trigger words". These trigger words are iteratively identified and injected into the target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four text classification datasets show that our proposed attack is significantly more effective than baseline methods while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods in defending against BITE and generalizes well to handling other backdoor attacks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源