用于检测文本对抗示例的频率指导单词替代

论文标题

用于检测文本对抗示例的频率指导单词替代

Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

论文作者

Mozes, Maximilian, Stenetorp, Pontus, Kleinberg, Bennett, Griffin, Lewis D.

论文摘要

最近的努力表明，神经文本处理模型容易受到对抗性例子的影响，但是这些例子的性质知之甚少。在这项工作中，我们表明对CNN，LSTM和基于变形金刚的分类模型的对抗性攻击执行单词替换，这些替换是通过替换单词及其相应替换之间的频率差异来识别的。基于这些发现，我们提出了频率引导的单词替换（FGWS），这是一种简单的算法，利用了对抗性单词替换的频率属性，以检测对抗性示例。 FGW通过准确检测SST-2和IMDB情感数据集上的对抗示例来实现强大的性能，而基于Roberta的分类模型，F1检测得分高达91.4％。我们将我们的方法与最近提出的扰动歧视框架进行了比较，并表明我们的表现高达13.0％的F1。

Recent efforts have shown that neural text processing models are vulnerable to adversarial examples, but the nature of these examples is poorly understood. In this work, we show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions that are identifiable through frequency differences between replaced words and their corresponding substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS), a simple algorithm exploiting the frequency properties of adversarial word substitutions for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets, with F1 detection scores of up to 91.4% against RoBERTa-based classification models. We compare our approach against a recently proposed perturbation discrimination framework and show that we outperform it by up to 13.0% F1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题