论文标题

通过自我训练的上下文嵌入无监督的bitext挖掘和翻译

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

论文作者

Keung, Phillip, Salazar, Julian, Lu, Yichao, Smith, Noah A.

论文摘要

我们描述了一种无监督的方法,用于创建未对齐文本的机器翻译(MT)的伪并行语料库。我们使用多语言BERT来创建源和目标句子嵌入,以换取最近的邻居搜索,并通过自我培训来调整模型。我们通过在BUCC 2017 Bitext挖掘任务上提取并行句子对来验证我们的技术,并观察到与以前的无监督方法相比,F1分数的24.5点增加(绝对)。 We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art.最后,我们用伪平行的wikipedia句子对丰富了IWSLT'15英语 - 越南语料库,对低资源MT任务的提高了1.2 BLEU。我们证明,无监督的bitext挖掘是增强MT数据集并补充现有技术(例如使用预训练的上下文嵌入)等现有技术的有效方法。

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源