论文标题
预验证的语言模型的地理适应
Geographic Adaptation of Pretrained Language Models
论文作者
论文摘要
虽然验证的语言模型(PLM)已被证明具有大量的语言知识,但现有的研究体已在很大程度上被忽略了语言知识,这通常很难单独在文本上预处理。在这里,我们通过检查地理语言知识(即有关语言的地理差异的知识)来缩小这一差距。我们介绍了地球适应,这是一个中级训练步骤,该步骤将语言建模与多任务学习设置中的地理位置预测相结合。我们将四个地理位置的语言组覆盖了四个PLM,并在五个不同的任务上进行了评估:微调(即监督的)地理位置预测,零射击(即无监督的)地理位置预测,微调语言识别,零刺激语言识别,零示意语言识别以及对对话的零选择。地球适应在将地理语言知识注入PLM的方面非常成功:地理位置化的PLM始终超过仅使用语言建模改编的PLM(尤其是在零拍摄的预测任务上尤其广泛的利润率),我们在两个基准分配预测和语言标识的两个基准测试结果上获得了新的最新结果。此外,我们表明地理适应的有效性源于其在地理上改造PLM的表示空间的能力。
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.