论文标题
诱导语言不可思议的多语言表示
Inducing Language-Agnostic Multilingual Representations
论文作者
论文摘要
跨语性表示有可能使世界上绝大多数语言都可以使用NLP技术。但是,他们目前需要大量的训练术或使用类型的语言。在这项工作中,我们通过从多语言嵌入中删除语言身份信号来解决这些障碍。我们研究了三种方法:(i)将目标语言的向量空间(全部)重新调整为枢轴源语言; (ii)删除特定语言的手段和差异,从而可以更好地区分嵌入作为副产品; (iii)通过消除形态收缩和重新排序来增加语言之间的输入相似性。我们在19种类型上多样化的语言中评估了XNLI和无参考的MT。我们的发现暴露了这些方法的局限性 - 与矢量归一化不同,矢量空间重新对准和文本标准化并不能在编码器和语言之间获得一致的收益。由于方法的添加效果,它们的组合将跨语性转移差距降低了8.9点(M-bert),平均在所有任务和语言中平均降低了18.2分(XLM-R)。我们的代码和模型公开可用。
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT across 19 typologically diverse languages. Our findings expose the limitations of these approaches -- unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches' additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however. Our code and models are publicly available.