论文标题
强大的对话剂对不可察觉的毒性触发
Robust Conversational Agents against Imperceptible Toxicity Triggers
论文作者
论文摘要
警告:本文包含可能令人反感或令人沮丧的内容。自然语言处理(NLP)的最新研究已推进了各种毒性检测模型的发展,目的是从现有系统中识别和减轻毒性语言。尽管在这一领域进行了大量的研究,但对迫使该系统产生有毒语言及其防御的对抗性攻击的关注较少。产生此类攻击的现有工作要么基于人类生成的攻击,该攻击是昂贵且不可扩展的,要么在发生自动攻击的情况下,攻击向量不符合类似人类的语言,可以使用语言模型丢失来检测。在这项工作中,我们提出了对对话剂的攻击,即它们在相干性,相关性和流利性方面符合对话,同时它们具有有效且可扩展的同时,即它们可以自动触发系统来产生毒性语言。然后,我们提出了一种防御机制,以防止这种攻击减轻攻击,还试图维持对话流动。通过自动和人类的评估,我们表明我们的防御能够有效避免有毒语言产生,甚至针对不可察觉的毒性触发者,而生成的语言则以相干性和相关性适合对话。最后,我们确定了这种防御机制在对话代理以外的语言产生模型上的普遍性。
Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.