论文标题
有条件的双语相互信息基于神经机器翻译的自适应培训
Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation
论文作者
论文摘要
令牌级别的自适应训练方法可以减轻令牌失衡问题,从而改善神经机器的翻译,通过重新加权基于特定的统计指标(例如,令牌频率或共同信息)的不同目标令牌的损失。鉴于标准翻译模型对先前目标上下文的状况做出了预测,我们认为以上统计指标忽略了目标上下文信息,并且可能为目标令牌分配不适当的权重。虽然一种可能的解决方案是将目标上下文直接纳入这些统计指标,但目标文本感知的统计计算非常昂贵,并且相应的存储开销是不现实的。为了解决上述问题,我们提出了一个目标 - 封闭式意识度量,称为条件双语互信息(CBMI),这使得为统计指标的目标上下文信息提供了可行性。特别是,我们的CBMI可以通过分解有条件的关节分布来形式化为翻译模型概率和语言模型概率的日志商。因此,可以在模型训练期间有效地计算CBMI,而无需任何特异性统计计算和大型存储开销。此外,我们提出了一种基于令牌和句子级CBMI的有效自适应训练方法。 WMT14英语 - 德国人和WMT19中文英语任务的实验结果表明,我们的方法可以大大优于变形金刚基线和其他相关方法。
Token-level adaptive training approaches can alleviate the token imbalance problem and thus improve neural machine translation, through re-weighting the losses of different target tokens based on specific statistical metrics (e.g., token frequency or mutual information). Given that standard translation models make predictions on the condition of previous target contexts, we argue that the above statistical metrics ignore target context information and may assign inappropriate weights to target tokens. While one possible solution is to directly take target contexts into these statistical metrics, the target-context-aware statistical computing is extremely expensive, and the corresponding storage overhead is unrealistic. To solve the above issues, we propose a target-context-aware metric, named conditional bilingual mutual information (CBMI), which makes it feasible to supplement target context information for statistical metrics. Particularly, our CBMI can be formalized as the log quotient of the translation model probability and language model probability by decomposing the conditional joint distribution. Thus CBMI can be efficiently calculated during model training without any pre-specific statistical calculations and large storage overhead. Furthermore, we propose an effective adaptive training approach based on both the token- and sentence-level CBMI. Experimental results on WMT14 English-German and WMT19 Chinese-English tasks show our approach can significantly outperform the Transformer baseline and other related methods.