非裔美国人英语的形态句法特征的歧义 - 习惯的情况

论文标题

非裔美国人英语的形态句法特征的歧义 - 习惯的情况

Disambiguation of morpho-syntactic features of African American English -- the case of habitual be

论文作者

Santiago, Harrison, Martin, Joshua, Moeller, Sarah, Tang, Kevin

论文摘要

最近的研究强调，自然语言处理（NLP）系统对非裔美国人的说话者表现出偏见。偏差错误通常是由于非裔美国人英语（AAE）独有的语言特征的不良表示，这是由于训练数据中发生许多此类特征的可能性相对较低。我们提出了一个工作流程，以克服习惯性“ BE”的情况。习惯性“ Be”是同构的，因此模棱两可，在AAE和其他英语品种中都发现了其他形式的“ BE”。这给NLP技术的偏见带来了明显的挑战。为了克服稀缺性，我们采用了基于规则的过滤器和数据增强的组合，从而在习惯和非海上实例之间产生了平衡的语料库。借助这种平衡的语料库，我们训练无偏见的机器学习分类器，正如AAE转录的文本中所证明的那样，达到了.65 f $ _1 $得分分数歧义的习惯“ be”。

Recent research has highlighted that natural language processing (NLP) systems exhibit a bias against African American speakers. The bias errors are often caused by poor representation of linguistic features unique to African American English (AAE), due to the relatively low probability of occurrence of many such features in training data. We present a workflow to overcome such bias in the case of habitual "be". Habitual "be" is isomorphic, and therefore ambiguous, with other forms of "be" found in both AAE and other varieties of English. This creates a clear challenge for bias in NLP technologies. To overcome the scarcity, we employ a combination of rule-based filters and data augmentation that generate a corpus balanced between habitual and non-habitual instances. With this balanced corpus, we train unbiased machine learning classifiers, as demonstrated on a corpus of AAE transcribed texts, achieving .65 F$_1$ score disambiguating habitual "be".

下载PDF全文

下载文献需遵守相关版权规定

论文标题