基于BERT模型的社交媒体中的仇恨言论检测和种族偏见缓解

论文标题

基于BERT模型的社交媒体中的仇恨言论检测和种族偏见缓解

Hate Speech Detection and Racial Bias Mitigation in Social Media based on BERT model

论文作者

Mozafari, Marzieh, Farahbakhsh, Reza, Crespi, Noel

论文摘要

最近与数据集和训练有素的虐待内容识别任务中的训练有素的分类器相关的偏见引起了许多问题。尽管对滥用语言检测的偏见数据集的问题被更频繁地解决，但训练有素的分类器引起的偏见尚未引起人们的关注。在这里，我们首先根据现有的名为BERT的预训练的语言模型引入了一种仇恨言语检测方法，并在两个公开可用的数据集上评估了针对种族主义，性别歧视，仇恨或Twitter上令人反感的内容的拟议模型。接下来，我们在仇恨言论检测任务中引入了一种偏见缓解机制，以减轻我们基于培训的BERT模型进行微调过程中训练集中偏差的效果。为此，我们使用现有的正则化方法来重新重量输入样品，从而降低了高相关训练集使用类标签的n-grams的效果，然后用新的重新加权样品微调了我们的基于预先培训的BERT模型。为了评估我们的偏见缓解机制，我们采用了一种跨域方法，在该方法中，我们在上述数据集中使用训练有素的分类器来预测Twitter，AAE和白人平衡的组的两个新数据集的标签，这些群体在非洲人英语（AAE）和标准的美国英语（AAE）和标准的英语（SAE）中分别表明了Tweets。结果表明，训练有素的分类器中存在系统的种族偏见，因为它们倾向于将AAE从AAE一致的群体编写的推文分配给种族主义，性别歧视，仇恨和冒犯性的负面阶级，而不是用白人与SAE编写的推文相比。但是，分类器中的种族偏见大大降低了我们的偏见缓解机制。这项工作可以迈出迈向辩解仇恨言论和辱骂语言检测系统的第一步。

Disparate biases associated with datasets and trained classifiers in hateful and abusive content identification tasks have raised many concerns recently. Although the problem of biased datasets on abusive language detection has been addressed more frequently, biases arising from trained classifiers have not yet been a matter of concern. Here, we first introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT and evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter. Next, we introduce a bias alleviation mechanism in hate speech detection task to mitigate the effect of bias in training set during the fine-tuning of our pre-trained BERT-based model. Toward that end, we use an existing regularization method to reweight input samples, thereby decreasing the effects of high correlated training set' s n-grams with class labels, and then fine-tune our pre-trained BERT-based model with the new re-weighted samples. To evaluate our bias alleviation mechanism, we employ a cross-domain approach in which we use the trained classifiers on the aforementioned datasets to predict the labels of two new datasets from Twitter, AAE-aligned and White-aligned groups, which indicate tweets written in African-American English (AAE) and Standard American English (SAE) respectively. The results show the existence of systematic racial bias in trained classifiers as they tend to assign tweets written in AAE from AAE-aligned group to negative classes such as racism, sexism, hate, and offensive more often than tweets written in SAE from White-aligned. However, the racial bias in our classifiers reduces significantly after our bias alleviation mechanism is incorporated. This work could institute the first step towards debiasing hate speech and abusive language detection systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题