论文标题
扩展新语言的多语言预审预周仔模型的子词模型
Extending the Subwording Model of Multilingual Pretrained Models for New Languages
论文作者
论文摘要
多语言预审经模型对于机器翻译和跨语性处理非常有效,因为它们在一种模型中包含多种语言。但是,它们在固定剂的固定剂后进行了审议。因此,在预处理后很难更改词汇。当我们将验证的模型扩展到新语言时,我们必须同时修改令牌。在本文中,我们将新的子字添加到句子令牌中,将多语言预审预告型模型应用于新语言(本文中的Inuktitut)。在我们的实验中,我们将Inuktitut句子分割为子词,而不更改已经预算的语言的分割,并将MBART-50预算模型应用于英语Inuktitut Translation。
Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.