仅推理的亚字符分解改善了看不见的逻辑字符的翻译

论文标题

仅推理的亚字符分解改善了看不见的逻辑字符的翻译

Inference-only sub-character decomposition improves translation of unseen logographic characters

论文作者

Saunders, Danielle, Feely, Weston, Byrne, Bill

论文摘要

在翻译“看不见”字符时，logographic源语言上的神经机器翻译（NMT）挣扎，这些字符从未出现在训练数据中。解决此问题的一种可能方法使用子字符分解来训练和测试句子。但是，这种方法涉及完整的重新培训，并且尚未完全探索其对未见角色翻译为非逻辑语言的有效性。我们针对高资源和低资源领域的中文对英语和日语至英国的NMT研究了现有的基于意识形态的子字符分解方法。对于每个语言对和域，我们构建一个测试集，其中所有源句子至少包含一个看不见的逻辑字符。我们发现，完整的亚字符分解通常会损害看不见的字符翻译，并总体上给出不一致的结果。我们提供了一种基于分解的简单替代方案，仅在推断看不见的字符之前。我们的方法可以灵活地应用，实现翻译充足性改进，不需要其他模型或培训。

Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题