论文标题

训练前波兰变压器的语言模型大规模的语言模型

Pre-training Polish Transformer-based Language Models at Scale

论文作者

Dadas, Sławomir, Perełkiewicz, Michał, Poświata, Rafał

论文摘要

现在,基于变压器的语言模型已被广泛用于自然语言处理(NLP)。对于英语,这种说法尤其如此,其中近年来已经发表了许多利用基于变压器的架构的预培训模型。这已经推动了各种标准NLP任务(例如分类,回归和序列标签)以及文本到文本任务(例如机器翻译,问题答案或摘要)的最新技术。但是,对于波兰语等低资源语言,情况有所不同。尽管可以提供一些基于变压器的语言模型,但就语料库的大小和参数数量而言,它们都没有接近规模,这是最大的英语模型的。在这项研究中,我们根据流行的BERT体系结构提供了两种用于波兰语的语言模型。较大的型号在包含超过10亿句句或135GB的原始文本的数据集上进行了培训。我们描述了收集数据,准备语料库和预训练模型的方法。然后,我们对13项波兰语言任务进行评估,并证明了其中11种方法的改进。

Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源