改进指示语言的多语言神经机器翻译系统

论文标题

改进指示语言的多语言神经机器翻译系统

Improving Multilingual Neural Machine Translation System for Indic Languages

论文作者

Das, Sudhansu Bala, Biradar, Atharv, Mishra, Tapas Kumar, Patra, Bidyut Kumar

论文摘要

机器翻译系统（MTS）是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中，对有效的翻译系统的需求变得显而易见，英语和一套印度语言（ILS）正式使用。与英语相反，由于语料库的不可用，IL仍然被视为低资源语言。为了解决不对称性质，多语言神经机器翻译（MNMT）系统在此方向上演变为理想的方法。在本文中，我们提出了一个MNMT系统，以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统，即用于英语印度（一对多），另一个用于指示英语（多一对），其中包含15个语言对（30个翻译方向）的共享编码器码头。由于大多数IL对具有很少的平行语料库，因此不足以训练任何机器翻译模型。我们探索各种增强策略，通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优越性。此外，本文解决了语言关系的使用（在方言，脚本等方面），尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外，实验结果还表明了ILS的倒退和域适应性的优势，以提高源和目标语言的翻译质量。使用所有这些关键方法，我们提出的模型在评估指标（即BLEU（双语评估研究）得分）的评分方面比基线模型更有效。

Machine Translation System (MTS) serves as an effective tool for communication by translating text or speech from one language to another language. The need of an efficient translation system becomes obvious in a large multilingual environment like India, where English and a set of Indian Languages (ILs) are officially used. In contrast with English, ILs are still entreated as low-resource languages due to unavailability of corpora. In order to address such asymmetric nature, multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. In this paper, we propose a MNMT system to address the issues related to low-resource language translation. Our model comprises of two MNMT systems i.e. for English-Indic (one-to-many) and the other for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have scanty amount of parallel corpora, not sufficient for training any machine translation model. We explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. Trials over a good amount of data reveal its superiority over the conventional models. In addition, the paper addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of backtranslation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics i.e BLEU (BiLingual Evaluation Understudy) score for a set of ILs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题