论文标题
音译有助于多语言语言建模吗?
Does Transliteration Help Multilingual Language Modeling?
论文作者
论文摘要
脚本多样性通过减少密切相关语言之间的词汇叠加来提出对多语言语言模型(MLLM)的挑战。因此,将使用不同的写作脚本与通用脚本的紧密相关的语言进行音译可能会改善MLLM的下游任务性能。在这种情况下,我们从经验上测量了音译对MLLM的影响。我们特别关注世界上脚本多样性最高的指示语言,并在Indicglue基准上评估了模型。我们执行Mann-Whitney U检验,以严格验证音译的影响是否显着。我们发现音译会受益于低资源语言,而不会对相对较高的资源语言产生负面影响。我们还使用来自Flores-101数据集的并行句子上的中心内核对齐模型的跨语义表示相似性。我们发现,对于跨不同语言的并行句子,基于音译的模型学习更相似的句子表示。
Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.