论文标题
吸引内在的恶魔:语言模型的自我氧化化
Leashing the Inner Demons: Self-Detoxification for Language Models
论文作者
论文摘要
语言模型(LMS)可以在培训期间复制(或放大)有毒语言,这对其实际应用构成了风险。在本文中,我们进行了广泛的实验来研究这种现象。我们分析提示,解码策略和培训语料库对产出毒性的影响。根据我们的发现,我们为语言模型提出了一种简单而有效的方法,可以在没有其他大型语料库或外部歧视者的情况下“排毒”自己。与受监督的基线相比,我们提出的方法在多种设置下的生成内容中显示出更好的毒性降低,并且在生成的内容中质量良好。警告:本文中显示的一些示例可能包含未经审查的进攻内容。
Language models (LMs) can reproduce (or amplify) toxic language seen during training, which poses a risk to their practical application. In this paper, we conduct extensive experiments to study this phenomenon. We analyze the impact of prompts, decoding strategies and training corpora on the output toxicity. Based on our findings, we propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator. Compared to a supervised baseline, our proposed method shows better toxicity reduction with good generation quality in the generated content under multiple settings. Warning: some examples shown in the paper may contain uncensored offensive content.