BITSA_NLP@LT-EDI-ACL2022：在社交媒体中利用验证的语言模型来检测同性恋恐惧症和恐惧症

论文标题

BITSA_NLP@LT-EDI-ACL2022：在社交媒体中利用验证的语言模型来检测同性恋恐惧症和恐惧症

bitsa_nlp@LT-EDI-ACL2022: Leveraging Pretrained Language Models for Detecting Homophobia and Transphobia in Social Media Comments

论文作者

Bhandari, Vitthal, Goyal, Poonam

论文摘要

在线社交网络无处不在且用户友好。然而，检测和适度的进攻内容以保持体面和同理心是至关重要的。但是，采矿社交媒体文本是一项复杂的任务，因为用户不遵守任何固定模式。评论可以用任何语言组合编写，其中许多语言可能是低资源的。在本文中，我们介绍了在社交媒体评论中检测同性恋恐惧症和跨性别恐惧症的LT-EDI共享任务。我们尝试了许多基于多语言的变压器模型，例如Mbert，以及用于解决类失衡的数据增强技术。这种预处理的大型模型最近在自然语言处理的各种基准任务上显示出巨大的成功。我们在YouTube评论和泰米尔语的YouTube评论的精心注释的现实生活数据集上观察到他们的表现。我们的提交的宏观平均得分分别为0.42、0.64和0.58，在英语，泰米尔语和泰米尔语 - 英语子任务中，我们的提交获得了9、6和3的排名。该系统的代码已开源。

Online social networks are ubiquitous and user-friendly. Nevertheless, it is vital to detect and moderate offensive content to maintain decency and empathy. However, mining social media texts is a complex task since users don't adhere to any fixed patterns. Comments can be written in any combination of languages and many of them may be low-resource. In this paper, we present our system for the LT-EDI shared task on detecting homophobia and transphobia in social media comments. We experiment with a number of monolingual and multilingual transformer based models such as mBERT along with a data augmentation technique for tackling class imbalance. Such pretrained large models have recently shown tremendous success on a variety of benchmark tasks in natural language processing. We observe their performance on a carefully annotated, real life dataset of YouTube comments in English as well as Tamil. Our submission achieved ranks 9, 6 and 3 with a macro-averaged F1-score of 0.42, 0.64 and 0.58 in the English, Tamil and Tamil-English subtasks respectively. The code for the system has been open sourced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题