论文标题
恶作剧:对变压器架构的简单黑盒攻击
Mischief: A Simple Black-Box Attack Against Transformer Architectures
论文作者
论文摘要
我们介绍了一种恶作剧,这是一种简单且轻巧的方法,可以为语言模型生成一类人类可读,现实的对抗性示例。我们在各种下游任务以及在不同浓度的上述示例下进行了四个基于变压器的架构的算法实验。我们的发现表明,在测试集中,恶作剧产生的对抗样本的存在大大降低(高达$ 20 \%$),这些模型在其报告的基线方面的性能。尽管如此,我们还证明,通过在训练集中包括类似的示例,可以在对抗性测试集上恢复基线得分。此外,对于某些任务,经过恶作剧训练的模型就其原始的非对抗性基线表现出适度的增长。
We introduce Mischief, a simple and lightweight method to produce a class of human-readable, realistic adversarial examples for language models. We perform exhaustive experimentations of our algorithm on four transformer-based architectures, across a variety of downstream tasks, as well as under varying concentrations of said examples. Our findings show that the presence of Mischief-generated adversarial samples in the test set significantly degrades (by up to $20\%$) the performance of these models with respect to their reported baselines. Nonetheless, we also demonstrate that, by including similar examples in the training set, it is possible to restore the baseline scores on the adversarial test set. Moreover, for certain tasks, the models trained with Mischief set show a modest increase on performance with respect to their original, non-adversarial baseline.