论文标题
TextGrad:通过梯度驱动的优化在NLP中提高鲁棒性评估
TextGrad: Advancing Robustness Evaluation in NLP by Gradient-Driven Optimization
论文作者
论文摘要
针对对抗性例子的鲁棒性评估对于揭示自然语言处理(NLP)中普遍的深层模型的可信度变得越来越重要。但是,与计算机视觉域相比,一阶投影梯度下降(PGD)被用作基准方法来生成对抗性示例以进行鲁棒性评估,因此缺乏NLP中原则上基于一阶梯度的稳健性评估框架。新兴的优化挑战在于1)文本输入的离散性质,以及扰动位置和实际内容之间的牢固耦合,以及2)附加的限制,即扰动文本应该流利并在语言模型下实现较低的困惑。这些挑战使类似PGD的NLP攻击的发展变得困难。为了弥合差距,我们提出了TextGrad,这是一种使用梯度驱动优化的新攻击发生器,支持NLP中对抗性鲁棒性的高精度和高质量评估。具体而言,我们在统一优化框架中解决了上述挑战。并且我们开发了一种有效的凸松弛方法,以优化连续删除的位点选择和扰动变量,并利用有效的采样方法来建立从连续优化变量到离散文本扰动的准确映射。此外,作为一阶攻击方法,可以将TextGrad烘烤到对抗训练中,以进一步提高NLP模型的稳健性。提供了广泛的实验,以证明TextGrad不仅在攻击生成中进行稳健性评估,而且还具有对抗性防御的有效性。
Robustness evaluation against adversarial examples has become increasingly important to unveil the trustworthiness of the prevailing deep models in natural language processing (NLP). However, in contrast to the computer vision domain where the first-order projected gradient descent (PGD) is used as the benchmark approach to generate adversarial examples for robustness evaluation, there lacks a principled first-order gradient-based robustness evaluation framework in NLP. The emerging optimization challenges lie in 1) the discrete nature of textual inputs together with the strong coupling between the perturbation location and the actual content, and 2) the additional constraint that the perturbed text should be fluent and achieve a low perplexity under a language model. These challenges make the development of PGD-like NLP attacks difficult. To bridge the gap, we propose TextGrad, a new attack generator using gradient-driven optimization, supporting high-accuracy and high-quality assessment of adversarial robustness in NLP. Specifically, we address the aforementioned challenges in a unified optimization framework. And we develop an effective convex relaxation method to co-optimize the continuously-relaxed site selection and perturbation variables and leverage an effective sampling method to establish an accurate mapping from the continuous optimization variables to the discrete textual perturbations. Moreover, as a first-order attack generation method, TextGrad can be baked into adversarial training to further improve the robustness of NLP models. Extensive experiments are provided to demonstrate the effectiveness of TextGrad not only in attack generation for robustness evaluation but also in adversarial defense.