低资源语言的同构跨语性嵌入

论文标题

低资源语言的同构跨语性嵌入

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

论文作者

Sannigrahi, Sonal, Read, Jesse

论文摘要

跨语言嵌入（CLWES）是将从高资源设置中学到的语言信息转移到低资源的关键组成部分。跨语性表示学习的最新研究集中在离线映射方法上，因为它们的简单性，计算功效和使用最少平行资源的能力。但是，它们至关重要地取决于嵌入空间的假设大致是同构的，即共享相似的几何结构，而几何结构在实践中不存在，从而导致低资源和遥远语言对的性能较差。在本文中，我们介绍了一个框架，以通过相关的高资源语言的联合利用来学习CLWES，而无需假设等轴测图。在我们的工作中，我们首先使用离线方法预先对准低资源和相关的语言嵌入空间，以减轻等距的假设。在此之后，我们使用联合培训方法为相关语言和目标嵌入式空间开发CLWES。最后，我们重塑了预先对准的低资源空间和生成最终clwes的目标空间。通过双语词典诱导（BLI）和特征值相似性衡量，我们对当前方法的质量和等法程度都保持一致：{Nepali，Nepali，Finnish，Romanian，Romanian，Gujarati，Gujarati，Hungarian} -Egrish。最后，我们的分析还指出了相关性以及可用的相关语言数据的数量是确定所达到的嵌入质量的关键因素。

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embed-ding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: {Nepali, Finnish, Romanian, Gujarati, Hungarian}-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题