论文标题

低资源多合成语言的中央Yup'ik和机器翻译

Central Yup'ik and Machine Translation of Low-Resource Polysynthetic Languages

论文作者

Liu, Christopher, Dominé, Laura, Chavez, Kevin, Socher, Richard

论文摘要

Yup'ik语言尚不存在机器翻译工具,这是一种大约8,000名主要居住在阿拉斯加西南部的人使用的多合成语言。我们为Yup'ik和英语编辑了平行的文本语料库,并根据语法规则为Yup'ik开发了形态学解析器。我们训练了一个SEQ2SEQ神经机器翻译模型,并注意将Yup'ik输入转换为英语。然后,我们比较了不同令牌化方法的影响,即基于规则的,无监督的(字节对编码)和无监督的形态学(Morfessor)解析的影响,对Yup'ik到英语翻译的BLEU得分精度。我们发现,与无需输入相比,使用令牌化输入提高了翻译精度。尽管总体珍藏师的词汇量最佳30k,但我们的第一个实验表明,BPE的词汇尺寸降低,表现最好。

Machine translation tools do not yet exist for the Yup'ik language, a polysynthetic language spoken by around 8,000 people who live primarily in Southwest Alaska. We compiled a parallel text corpus for Yup'ik and English and developed a morphological parser for Yup'ik based on grammar rules. We trained a seq2seq neural machine translation model with attention to translate Yup'ik input into English. We then compared the influence of different tokenization methods, namely rule-based, unsupervised (byte pair encoding), and unsupervised morphological (Morfessor) parsing, on BLEU score accuracy for Yup'ik to English translation. We find that using tokenized input increases the translation accuracy compared to that of unparsed input. Although overall Morfessor did best with a vocabulary size of 30k, our first experiments show that BPE performed best with a reduced vocabulary size.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源