论文标题
仅推理的亚字符分解改善了看不见的逻辑字符的翻译
Inference-only sub-character decomposition improves translation of unseen logographic characters
论文作者
论文摘要
在翻译“看不见”字符时,logographic源语言上的神经机器翻译(NMT)挣扎,这些字符从未出现在训练数据中。解决此问题的一种可能方法使用子字符分解来训练和测试句子。但是,这种方法涉及完整的重新培训,并且尚未完全探索其对未见角色翻译为非逻辑语言的有效性。 我们针对高资源和低资源领域的中文对英语和日语至英国的NMT研究了现有的基于意识形态的子字符分解方法。对于每个语言对和域,我们构建一个测试集,其中所有源句子至少包含一个看不见的逻辑字符。我们发现,完整的亚字符分解通常会损害看不见的字符翻译,并总体上给出不一致的结果。 我们提供了一种基于分解的简单替代方案,仅在推断看不见的字符之前。我们的方法可以灵活地应用,实现翻译充足性改进,不需要其他模型或培训。
Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.