对抗训练对语言模型的鲁棒性和概括性的影响

论文标题

对抗训练对语言模型的鲁棒性和概括性的影响

Impact of Adversarial Training on Robustness and Generalizability of Language Models

论文作者

Altinisik, Enes, Sajjad, Hassan, Sencar, Husrev Taha, Messaoud, Safa, Chawla, Sanjay

论文摘要

对抗性训练被广泛认为是针对对抗攻击的最有效的防御。但是，还可以很好地确定，在对抗训练的模型中实现鲁棒性和概括都涉及权衡取舍。这项工作的目的是对语言模型中对抗性训练的不同方法进行深入比较。具体而言，我们研究了预训练数据增强的影响以及训练时间输入扰动与嵌入空间扰动对基于变压器语言模型的稳健性和概括的嵌入。我们的发现表明，通过预训练数据增强或通过输入空间扰动培训可以实现更好的鲁棒性。但是，用嵌入空间扰动的培训显着提高了概括。对学习模型的神经元的语言相关性分析表明，改进的概括是由于“更专业”的神经元引起的。据我们所知，这是对在语言模型对抗性培训中产生对抗性示例的不同方法进行深入定性分析的第一项工作。

Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of transformer-based language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveals that the improved generalization is due to 'more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题