通过基于词典的改编，将预告额的模型扩展到数千种语言

论文标题

通过基于词典的改编，将预告额的模型扩展到数千种语言

Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

论文作者

Wang, Xinyi, Ruder, Sebastian, Neubig, Graham

论文摘要

多语言预审式模型的性能高度取决于目标语言中存在单语或平行文本的可用性。因此，世界上大多数语言都无法从NLP的最新进展中受益，因为它们没有文本数据或有限的文本数据。为了扩大在这些代表性不足的语言中使用NLP技术的可能性，我们系统地研究了通过使用双语词典来放宽对传统语言资源的依赖的策略，这是一种具有更好语言覆盖范围的替代资源。我们分析了使用词典来合成文本或标记数据的不同策略，以及该数据在可用时如何与单语言或平行文本结合使用。对于在3个任务中代表性不足的19种语言中，我们的方法分别在有没有额外的单语文本的情况下导致最多5和15分的一致改进。总体而言，我们的研究强调了如何将NLP方法适应数千种由当前技术服务不足的语言

The performance of multilingual pretrained models is highly dependent on the availability of monolingual or parallel text present in a target language. Thus, the majority of the world's languages cannot benefit from recent progress in NLP as they have no or limited textual data. To expand possibilities of using NLP technology in these under-represented languages, we systematically study strategies that relax the reliance on conventional language resources through the use of bilingual lexicons, an alternative resource with much better language coverage. We analyze different strategies to synthesize textual or labeled data using lexicons, and how this data can be combined with monolingual or parallel text when available. For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively. Overall, our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology

下载PDF全文

下载文献需遵守相关版权规定

论文标题