调查阿拉伯语 - 英语密码开关数据扩展的词汇替换

论文标题

调查阿拉伯语 - 英语密码开关数据扩展的词汇替换

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

论文作者

Hamed, Injy, Habash, Nizar, Abdennadher, Slim, Vu, Ngoc Thang

论文摘要

数据稀疏性是阻碍代码转换（CS）NLP系统开发的主要问题。在本文中，我们研究了用于综合方言阿拉伯语英语CS文本的数据增强技术。我们使用单词平行的平行语料库执行词汇替换，其中CS点是随机选择或使用序列到序列模型学习的。我们将这些方法与基于字典的替代品进行了比较。我们通过人类评估评估生成的句子的质量，并评估数据增强对机器翻译（MT），自动语音识别（ASR）和语音翻译（ST）任务的有效性。结果表明，与人类判断中报道的随机方法相比，使用预测模型会导致更自然的CS句子。在下游任务中，尽管随机方法生成了更多数据，但两种方法都同样执行（胜过基于字典的替代品）。总体而言，数据增强可实现34％的困惑性提高，ASR任务的相对提高了5.2％，在MT任务上， +4.0-5.1 BLEU点， +2.1-2.2 BLEU点在ST上，在对可用数据的基线训练的基线上，无需扩大。

Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题