一种基于促进的对抗性示例生成和鲁棒性增强的方法

论文标题

一种基于促进的对抗性示例生成和鲁棒性增强的方法

A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement

论文作者

Yang, Yuting, Huang, Pei, Cao, Juan, Li, Jintao, Lin, Yun, Dong, Jin Song, Ma, Feifei, Zhang, Jian

论文摘要

近年来，NLP模型在金融，医疗和新闻媒体等关键领域的广泛应用，引起了人们对模型鲁棒性和脆弱性的关注。在本文中，我们提出了一种新颖的基于及时的对抗性攻击，以妥协NLP模型和鲁棒性增强技术。我们首先为每个实例构建恶意提示，并在恶意目的的效果下通过面具和填充来生成对抗性示例。我们的攻击技术针对NLP模型的固有漏洞，即使我们不与受害者NLP模型进行交互，也可以生成样本，只要它基于预先训练的语言模型（PLM）即可。此外，我们设计了一种基于迅速的对抗训练方法来提高PLM的鲁棒性。由于我们的训练方法实际上并未生成对抗样本，因此可以有效地应用于大型训练集。实验结果表明，我们的攻击方法可以通过更加多样化，流利和自然的对抗性实例获得高攻击成功率。此外，我们的鲁棒性增强方法可以显着提高模型抵抗对抗性攻击的鲁棒性。我们的工作表明，提示范式在探测PLM的一些基本缺陷和对下游任务进行微调方面具有很大的潜力。

Recent years have seen the wide application of NLP models in crucial areas such as finance, medical treatment, and news media, raising concerns of the model robustness and vulnerabilities. In this paper, we propose a novel prompt-based adversarial attack to compromise NLP models and robustness enhancement technique. We first construct malicious prompts for each instance and generate adversarial examples via mask-and-filling under the effect of a malicious purpose. Our attack technique targets the inherent vulnerabilities of NLP models, allowing us to generate samples even without interacting with the victim NLP model, as long as it is based on pre-trained language models (PLMs). Furthermore, we design a prompt-based adversarial training method to improve the robustness of PLMs. As our training method does not actually generate adversarial samples, it can be applied to large-scale training sets efficiently. The experimental results show that our attack method can achieve a high attack success rate with more diverse, fluent and natural adversarial examples. In addition, our robustness enhancement method can significantly improve the robustness of models to resist adversarial attacks. Our work indicates that prompting paradigm has great potential in probing some fundamental flaws of PLMs and fine-tuning them for downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题