论文标题

仅推理的亚字符分解改善了看不见的逻辑字符的翻译

Inference-only sub-character decomposition improves translation of unseen logographic characters

论文作者

Saunders, Danielle, Feely, Weston, Byrne, Bill

论文摘要

在翻译“看不见”字符时,logographic源语言上的神经机器翻译(NMT)挣扎,这些字符从未出现在训练数据中。解决此问题的一种可能方法使用子字符分解来训练和测试句子。但是,这种方法涉及完整的重新培训,并且尚未完全探索其对未见角色翻译为非逻辑语言的有效性。 我们针对高资源和低资源领域的中文对英语和日语至英国的NMT研究了现有的基于意识形态的子字符分解方法。对于每个语言对和域,我们构建一个测试集,其中所有源句子至少包含一个看不见的逻辑字符。我们发现,完整的亚字符分解通常会损害看不见的字符翻译,并总体上给出不一致的结果。 我们提供了一种基于分解的简单替代方案,仅在推断看不见的字符之前。我们的方法可以灵活地应用,实现翻译充足性改进,不需要其他模型或培训。

Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源