论文标题
Serengeti:非洲的大量多语言语言模型
SERENGETI: Massively Multilingual Language Models for Africa
论文作者
论文摘要
多语言预审前的语言模型(MPLM)在训练过程中获得了有价值的,可概括的语言信息,并在特定于任务的填充方面提高了最新技术。迄今为止,现有语言模型中涵盖了约2,000种非洲语言中的31个。我们通过开发Serengeti来改善这种限制,这是一种涵盖517种非洲语言和语言品种的大量多语言模型。我们在20个数据集中的八个自然语言理解任务上评估了我们的新型模型,与涵盖4-23种非洲语言的4个MPLM相比。 Serengeti在八个任务上的11个数据集上的其他模型优于其他型号,达到82.27平均F_1。我们还从模型中进行了错误的分析,这使我们能够研究在零拍设置下应用模型时语言家谱和语言相似性的影响。我们将公开发布我们的研究模型。
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}