论文标题
神经统计特征的对抗性鲁棒性在检测生成变压器时
Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers
论文作者
论文摘要
由于新生的生成模型允许有效地创建引人注目的类似人类的文本,因此检测计算机生成的文本是一个迅速提高意义的领域,可能是出于垃圾邮件,虚假信息,网络钓鱼或在线影响运动而被滥用的。过去的工作研究了对当前最新模型的检测,但是尽管存在威胁局势,但对对抗性攻击的检测方法的鲁棒性的分析很少。为此,我们评估了神经和非神经方法的检测能力,它们可以检测到计算机生成的文本,对文本对抗性攻击的鲁棒性以及成功的对抗性攻击对人类对文本质量的判断的影响。我们发现,尽管统计特征表现不佳的神经特征,但统计特征提供了其他对抗性鲁棒性,可以在整体检测模型中利用这些特征。在此过程中,我们发现以前有效的复杂短语特征用于检测计算机生成的文本对现代生成模型的预测能力很小,并确定了有希望的统计特征。最后,我们开创了$δ$淡紫色的使用,作为对对抗文本质量判断的替代措施。
The detection of computer-generated text is an area of rapidly increasing significance as nascent generative models allow for efficient creation of compelling human-like text, which may be abused for the purposes of spam, disinformation, phishing, or online influence campaigns. Past work has studied detection of current state-of-the-art models, but despite a developing threat landscape, there has been minimal analysis of the robustness of detection methods to adversarial attacks. To this end, we evaluate neural and non-neural approaches on their ability to detect computer-generated text, their robustness against text adversarial attacks, and the impact that successful adversarial attacks have on human judgement of text quality. We find that while statistical features underperform neural features, statistical features provide additional adversarial robustness that can be leveraged in ensemble detection models. In the process, we find that previously effective complex phrasal features for detection of computer-generated text hold little predictive power against contemporary generative models, and identify promising statistical features to use instead. Finally, we pioneer the usage of $Δ$MAUVE as a proxy measure for human judgement of adversarial text quality.