对抗性训练以改善模型鲁棒性？看预测和解释

论文标题

对抗性训练以改善模型鲁棒性？看预测和解释

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

论文作者

Chen, Hanjie, Ji, Yangfeng

论文摘要

神经语言模型显示出对对抗性示例的脆弱性，这些示例在语义上与其原始同行相似，几个单词被其同义词取代。提高模型鲁棒性的一种常见方法是对抗性训练，它通过攻击目标模型来遵循两个步骤收集的对抗示例，并使用这些对抗性示例在增强数据集中对模型进行微调。传统对抗训练的目的是使模型在原始/对抗性示例对上产生相同的正确预测。但是，在两个类似文本上的模型决策之间的一致性被忽略了。我们认为，强大的模型应在原始/对抗性示例对上持续行为，这是基于相同的原因（如何）对一致的解释反映的相同的预测（什么）。在这项工作中，我们提出了一种名为Flat的新型功能水平的对抗训练方法。平面目的是在预测和解释方面改善模型鲁棒性。 FLAT在神经网络中结合了各种单词掩码，以学习全球单词的重要性，并作为一种瓶颈，教导模型以基于重要单词进行预测。平坦的是，通过在原始/对抗性示例对中的模型理解对相应的全局单词重要性得分正规化，在替换单词上的模型理解与其同义词之间的不匹配引起的脆弱性问题。实验表明，对于四个神经网络模型（LSTM，CNN，BERT和DEBERTA）的预测和解释，FLAT在改善鲁棒性方面的有效性，对四个文本分类任务的两次对抗性攻击。在不同攻击的不可预见的对抗示例上，通过Flat训练的模型也比基线模型表现出更好的鲁棒性。

Neural language models show vulnerability to adversarial examples which are semantically similar to their original counterparts with a few words replaced by their synonyms. A common way to improve model robustness is adversarial training which follows two steps-collecting adversarial examples by attacking a target model, and fine-tuning the model on the augmented dataset with these adversarial examples. The objective of traditional adversarial training is to make a model produce the same correct predictions on an original/adversarial example pair. However, the consistency between model decision-makings on two similar texts is ignored. We argue that a robust model should behave consistently on original/adversarial example pairs, that is making the same predictions (what) based on the same reasons (how) which can be reflected by consistent interpretations. In this work, we propose a novel feature-level adversarial training method named FLAT. FLAT aims at improving model robustness in terms of both predictions and interpretations. FLAT incorporates variational word masks in neural networks to learn global word importance and play as a bottleneck teaching the model to make predictions based on important words. FLAT explicitly shoots at the vulnerability problem caused by the mismatch between model understandings on the replaced words and their synonyms in original/adversarial example pairs by regularizing the corresponding global word importance scores. Experiments show the effectiveness of FLAT in improving the robustness with respect to both predictions and interpretations of four neural network models (LSTM, CNN, BERT, and DeBERTa) to two adversarial attacks on four text classification tasks. The models trained via FLAT also show better robustness than baseline models on unforeseen adversarial examples across different attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题