重新考虑自然语言处理中后门攻击的攻击强度的评估

论文标题

重新考虑自然语言处理中后门攻击的攻击强度的评估

Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing

论文作者

Shen, Lingfeng, Jiang, Haiyun, Liu, Lemao, Shi, Shuming

论文摘要

已经表明，自然语言处理（NLP）模型容易受到一种称为后门攻击的安全威胁，该威胁使用“后门触发”范式来误导模型。最具威胁性的后门攻击是隐形的后门，该后门将触发器定义为文本样式或句法。尽管他们取得了令人难以置信的高攻击成功率（ASR），但我们发现促成ASR的主要因素不是“后门触发”范式。因此，当将这些隐秘的后门攻击的能力分类为后门攻击时，被高估了。因此，为了评估后门攻击的真正攻击能力，我们提出了一个称为攻击成功率差异（ASRD）的新指标，该指标衡量了清洁状态和毒液状态模型之间的ASR差异。此外，由于缺乏防御隐形后门攻击的防御措施，我们提出了触发器破坏者，包括两个太简单的技巧，可以有效地防止隐身的后门攻击。实验表明，我们的方法比针对隐身后门攻击的最先进的防御方法取得了明显的表现。

It has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题