论文标题
使用Marco的排毒文本:可控制的修订专家和反专家
Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts
论文作者
论文摘要
文本排毒有可能通过重塑文本来消除进攻意义来减轻毒性的危害,但微妙的毒性仍然具有挑战性。我们介绍了一种使用自动编码器语言模型(LMS)的专家的产品组合可控制的生成和文本重写方法的解毒算法Marco。 Marco在无毒的LM(专家)和有毒的LM(抗Expert)下使用可能性,以找到候选词来掩盖并有可能取代。我们对几种微妙的毒性和微侵略数据集进行了评估,并表明它不仅超过了自动指标的基准,而且在人类评估中,Marco的重写优于2.1 $ \ times $。它适用于微妙毒性实例特别有希望,这表明了解决日益难以捉摸的在线仇恨的前进道路。
Text detoxification has the potential to mitigate the harms of toxicity by rephrasing text to remove offensive meaning, but subtle toxicity remains challenging to tackle. We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods using a Product of Experts with autoencoder language models (LMs). MaRCo uses likelihoods under a non-toxic LM (expert) and a toxic LM (anti-expert) to find candidate words to mask and potentially replace. We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $\times$ more in human evaluation. Its applicability to instances of subtle toxicity is especially promising, demonstrating a path forward for addressing increasingly elusive online hate.