审计的跨语言模型的无监督域的适应

论文标题

审计的跨语言模型的无监督域的适应

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

论文作者

Li, Juntao, He, Ruidan, Ye, Hai, Ng, Hwee Tou, Bing, Lidong, Yan, Rui

论文摘要

最近的研究表明，大规模未标记文本上的跨语性语言模型在各种跨语言和低资源任务上都会进行重大的性能改善。通过对一百种语言和文本的培训，跨语性语言模型已被证明有效地利用高资源语言来增强低资源语言处理和胜过单语言模型。在本文中，我们进一步研究了术语跨语言模型需要适应新领域时的跨语性和跨域（CLCD）设置。具体而言，我们提出了一种新颖的无监督特征分解方法，该方法可以自动从源语言中的未标记的原始文本中自动从纠缠的预审计的跨语义表示中提取特定的域特征和域不变特征。我们提出的模型利用共同信息估计将由跨语性模型计算的表示形式分解为域不变和特定于域的部分。实验结果表明，我们提出的方法在CLCD环境中对最先进的跨语性语言模型实现了显着的性能提高。本文的源代码可在https://github.com/lijuntaopku/ufd上公开获得。

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting. The source code of this paper is publicly available at https://github.com/lijuntaopku/UFD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题