论文标题

Tencent AI实验室 - 上海Jiao Tong University WMT22翻译任务的低资源翻译系统

Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation System for the WMT22 Translation Task

论文作者

He, Zhiwei, Wang, Xing, Tu, Zhaopeng, Shi, Shuming, Wang, Rui

论文摘要

本文介绍了腾讯AI实验室 - 上海若o汤大学(TAL-SJTU),用于WMT22共享任务的低资源翻译系统。我们参加了英语$ \ leftrightarrow $ livonian的一般翻译任务。我们的系统基于M2M100,其新型技术将其适应目标语言对。 (1)跨模型单词嵌入对准:受跨语性单词嵌入对齐的启发,我们成功地将预训练的单词嵌入将嵌入到M2M100上,使其能够支持Livonian。 (2)逐步适应策略:我们利用爱沙尼亚语和拉脱维亚语作为辅助语言进行多到多次的翻译培训,然后适应英语律师。 (3)数据增强:为了扩大英语 - 利多尼亚人的并行数据,我们将伪并行数据用爱沙尼亚语和拉脱维亚语作为枢轴语言。 (4)微调:为了充分利用所有可用数据,我们将使用验证集和在线背面翻译微调模型,从而进一步提高了性能。在模型评估中:(1)我们发现,由于不一致的Unicode归一化,以前的工作低估了Livonian的翻译性能,这可能会导致高达14.9 BLEU得分的差异。 (2)除了标准验证集外,我们还采用往返BLEU来评估模型,我们认为这更适合此任务。最后,我们的不受约束的系统在livonian的英语中获得了17.0的BLEU得分,而30.4的得分为30.4。

This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task. We participate in the general translation task on English$\Leftrightarrow$Livonian. Our system is based on M2M100 with novel techniques that adapt it to the target language pair. (1) Cross-model word embedding alignment: inspired by cross-lingual word embedding alignment, we successfully transfer a pre-trained word embedding to M2M100, enabling it to support Livonian. (2) Gradual adaptation strategy: we exploit Estonian and Latvian as auxiliary languages for many-to-many translation training and then adapt to English-Livonian. (3) Data augmentation: to enlarge the parallel data for English-Livonian, we construct pseudo-parallel data with Estonian and Latvian as pivot languages. (4) Fine-tuning: to make the most of all available data, we fine-tune the model with the validation set and online back-translation, further boosting the performance. In model evaluation: (1) We find that previous work underestimated the translation performance of Livonian due to inconsistent Unicode normalization, which may cause a discrepancy of up to 14.9 BLEU score. (2) In addition to the standard validation set, we also employ round-trip BLEU to evaluate the models, which we find more appropriate for this task. Finally, our unconstrained system achieves BLEU scores of 17.0 and 30.4 for English to/from Livonian.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源