通过自我训练的上下文嵌入无监督的bitext挖掘和翻译

论文标题

通过自我训练的上下文嵌入无监督的bitext挖掘和翻译

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

论文作者

Keung, Phillip, Salazar, Julian, Lu, Yichao, Smith, Noah A.

论文摘要

我们描述了一种无监督的方法，用于创建未对齐文本的机器翻译（MT）的伪并行语料库。我们使用多语言BERT来创建源和目标句子嵌入，以换取最近的邻居搜索，并通过自我培训来调整模型。我们通过在BUCC 2017 Bitext挖掘任务上提取并行句子对来验证我们的技术，并观察到与以前的无监督方法相比，F1分数的24.5点增加（绝对）。 We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art.最后，我们用伪平行的wikipedia句子对丰富了IWSLT'15英语 - 越南语料库，对低资源MT任务的提高了1.2 BLEU。我们证明，无监督的bitext挖掘是增强MT数据集并补充现有技术（例如使用预训练的上下文嵌入）等现有技术的有效方法。

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题