训练前波兰变压器的语言模型大规模的语言模型

论文标题

训练前波兰变压器的语言模型大规模的语言模型

Pre-training Polish Transformer-based Language Models at Scale

论文作者

Dadas, Sławomir, Perełkiewicz, Michał, Poświata, Rafał

论文摘要

现在，基于变压器的语言模型已被广泛用于自然语言处理（NLP）。对于英语，这种说法尤其如此，其中近年来已经发表了许多利用基于变压器的架构的预培训模型。这已经推动了各种标准NLP任务（例如分类，回归和序列标签）以及文本到文本任务（例如机器翻译，问题答案或摘要）的最新技术。但是，对于波兰语等低资源语言，情况有所不同。尽管可以提供一些基于变压器的语言模型，但就语料库的大小和参数数量而言，它们都没有接近规模，这是最大的英语模型的。在这项研究中，我们根据流行的BERT体系结构提供了两种用于波兰语的语言模型。较大的型号在包含超过10亿句句或135GB的原始文本的数据集上进行了培训。我们描述了收集数据，准备语料库和预训练模型的方法。然后，我们对13项波兰语言任务进行评估，并证明了其中11种方法的改进。

Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题