神经机器翻译的凝集语言上的形态单词分割

论文标题

神经机器翻译的凝集语言上的形态单词分割

Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

论文作者

Pan, Yirong, Li, Xiao, Yang, Yating, Dong, Rui

论文摘要

近年来，神经机器翻译（NMT）在机器翻译任务上取得了令人印象深刻的性能。但是，考虑到效率，仅包含最高频率单词的有限尺寸词汇用于模型训练，这导致了许多罕见和未知的单词。从低资源和形态上富含形态的凝集性语言中翻译出来，它们具有复杂的形态和大词汇。在本文中，我们在NMT的源端提出了一种形态学单词分割方法，该方法结合了形态学知识，以将语言和语义信息保留在单词结构中，同时减少训练时间的词汇大小。它可以用作用于其他自然语言处理（NLP）任务的凝集性语言中的单词的预处理工具。实验结果表明，我们以形态动机的单词分割方法更适合NMT模型，该模型可在降低数据稀少性和语言复杂性的情况下对土耳其 - 英语和Uyghur-Chinese机器翻译任务进行了重大改进。

Neural machine translation (NMT) has achieved impressive performance on machine translation task in recent years. However, in consideration of efficiency, a limited-size vocabulary that only contains the top-N highest frequency words are employed for model training, which leads to many rare and unknown words. It is rather difficult when translating from the low-resource and morphologically-rich agglutinative languages, which have complex morphology and large vocabulary. In this paper, we propose a morphological word segmentation method on the source-side for NMT that incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time. It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks. Experimental results show that our morphologically motivated word segmentation method is better suitable for the NMT model, which achieves significant improvements on Turkish-English and Uyghur-Chinese machine translation tasks on account of reducing data sparseness and language complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题