论文标题
关于进攻性语言分类器的鲁棒性
On The Robustness of Offensive Language Classifiers
论文作者
论文摘要
社交媒体平台正在部署基于进攻性语言分类系统的机器学习,以便对可恶,种族主义和其他形式的进攻性演讲进行对抗。但是,尽管他们的现实部署了它们,但我们尚未全面了解进攻性语言分类器对对抗性攻击的强大程度。在这个领域的先前工作仅限于研究进攻性语言分类器的鲁棒性,以防止拼写错误和外部空间等原始攻击。为了解决这一差距,我们系统地分析了最先进的进攻性语言分类器的鲁棒性,以防止更狡猾的对抗性攻击,这些攻击利用了贪婪和基于注意力的单词选择以及上下文感知的嵌入单词更换。我们在多个数据集上的结果表明,这些狡猾的对抗攻击可以使进攻性语言分类器的准确性降低50%以上,同时也能够保留修改后的文本的可读性和含义。
Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text.