量化对对抗单词替换的鲁棒性

论文标题

量化对对抗单词替换的鲁棒性

Quantifying Robustness to Adversarial Word Substitutions

论文作者

Yang, Yuting, Huang, Pei, Ma, FeiFei, Cao, Juan, Zhang, Meishan, Zhang, Jian, Li, Jintao

论文摘要

发现基于深度学习的NLP模型容易受到单词替代扰动的影响。在广泛采用之前，需要解决鲁棒性的基本问题。沿着这条线，我们提出了一个正式的框架来评估单词级的鲁棒性。首先，为了研究模型的安全区域，我们引入了稳健性半径，这是模型可以抵抗任何扰动的边界。随着计算最大鲁棒性半径在计算上很难，我们估计其上限和下限。我们将攻击方法重新使用为寻求上限和设计伪动态编程算法的方法，以使上限更紧密。然后将验证方法用于下限。此外，为了评估安全半径以外的区域的鲁棒性，我们从另一种角度重新检查了鲁棒性：量化。引入了具有严格统计保证的鲁棒性度量，以衡量对抗性示例的量化，这表明该模型对安全半径以外的扰动的敏感性。该指标可以帮助我们弄清楚为什么像伯特这样的最新模型很容易被几个单词替换所愚弄，而在存在现实世界的噪音的情况下会很好地概括。

Deep-learning-based NLP models are found to be vulnerable to word substitution perturbations. Before they are widely adopted, the fundamental issues of robustness need to be addressed. Along this line, we propose a formal framework to evaluate word-level robustness. First, to study safe regions for a model, we introduce robustness radius which is the boundary where the model can resist any perturbation. As calculating the maximum robustness radius is computationally hard, we estimate its upper and lower bound. We repurpose attack methods as ways of seeking upper bound and design a pseudo-dynamic programming algorithm for a tighter upper bound. Then verification method is utilized for a lower bound. Further, for evaluating the robustness of regions outside a safe radius, we reexamine robustness from another view: quantification. A robustness metric with a rigorous statistical guarantee is introduced to measure the quantification of adversarial examples, which indicates the model's susceptibility to perturbations outside the safe radius. The metric helps us figure out why state-of-the-art models like BERT can be easily fooled by a few word substitutions, but generalize well in the presence of real-world noises.

下载PDF全文

下载文献需遵守相关版权规定

论文标题