论文标题

多语言翻译具有可扩展的多语言预处理和填充

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

论文作者

Tang, Yuqing, Tran, Chau, Li, Xian, Chen, Peng-Jen, Goyal, Naman, Chaudhary, Vishrav, Gu, Jiatao, Fan, Angela

论文摘要

最近的工作表明了创建一种可用于不同语言的各种任务的模型的多语言预测的潜力。以前的多语言预审查的工作表明,可以通过Bitext上的Finetuning创建机器翻译系统。在这项工作中,我们表明可以通过多语言登录创建多语言翻译模型。审计的模型不是在一个方向上进行填充,而是同时在许多方向上进行了填充。与从头开始训练的多语言模型相比,从审慎的模型开始,结合了大量未标记的单语言数据的好处,这对于不可用Bitext的低资源语言尤为重要。我们证明可以扩展验证的模型以合并其他语言而不会丧失性能。我们将MBART中的语言数量增加一倍,以支持50种语言的多语言机器翻译模型。最后,我们创建了ML50基准,涵盖低,中和高资源语言,以通过标准化培训和评估数据来促进可重复的研究。在ML50上,我们证明,多语言的登录平均比最强的基准(从头开始或双语芬太尼进行多种语言),而平均在划痕上平均双语基线的BLEU。

Recent work demonstrates the potential of multilingual pretraining of creating one model that can be used for various tasks in different languages. Previous work in multilingual pretraining has demonstrated that machine translation systems can be created by finetuning on bitext. In this work, we show that multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. Compared to multilingual models trained from scratch, starting from pretrained models incorporates the benefits of large quantities of unlabeled monolingual data, which is particularly important for low resource languages where bitext is not available. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance. We double the number of languages in mBART to support multilingual machine translation models of 50 languages. Finally, we create the ML50 benchmark, covering low, mid, and high resource languages, to facilitate reproducible research by standardizing training and evaluation data. On ML50, we demonstrate that multilingual finetuning improves on average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while improving 9.3 BLEU on average over bilingual baselines from scratch.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源