理解和改善非自动收入翻译的词汇选择

论文标题

理解和改善非自动收入翻译的词汇选择

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

论文作者

Ding, Liang, Wang, Longyue, Liu, Xuebo, Wong, Derek F., Tao, Dacheng, Tu, Zhaopeng

论文摘要

知识蒸馏（KD）对于通过自回归教师模型降低原始数据的复杂性来训练非自动回忆翻译（NAT）模型至关重要。在这项研究中，我们从经验上表明，作为该培训的副作用，低频单词的词汇选择错误从教师模型中传播到NAT模型中。为了减轻这个问题，我们建议将原始数据暴露于NAT模型，以恢复低频单词的有用信息，这些信息在蒸馏数据中遗漏了。为此，我们通过比较NAT模型的词汇选择和嵌入在原始数据中的词汇选择来引入一个额外的kullback-leibler差异项。跨语言对和模型体系结构的实验结果证明了所提出方法的有效性和普遍性。广泛的分析证实了我们的说法，即我们的方法通过减少低频词的词汇选择错误来提高性能。令人鼓舞的是，我们的方法将SOTA NAT在WMT14英国 - 德国人和WMT16罗马尼亚 - 英语数据集上的表现分别提高到27.8和33.8 BLEU点。源代码将发布。

Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The source code will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题