论文标题
给您的文字表示模型一些爱:巴斯克的情况
Give your Text Representation Models some Love: the Case for Basque
论文作者
论文摘要
单词嵌入式和预训练的语言模型允许构建文本的丰富表示形式,并在大多数NLP任务中启用了改进。不幸的是,他们的培训非常昂贵,许多小型公司和研究小组倾向于使用已预先培训并由第三方提供的模型,而不是自己建立自己的模型。这是次优的,因为对于许多语言,这些模型已经接受了较小(或较低质量)的语料库的培训。此外,非英语语言的单语培训模型并不总是可用。充其量,这些语言的模型包含在多语言版本中,其中每种语言都与其余的语言共享子字符串和参数的配额。对于巴斯克等较小的语言,尤其如此。在本文中,我们表明,经过较大巴斯克语料库训练的许多单语模型(FastText Word Embeddings,Flair和Bert语言模型)比下游NLP任务中的公开版本的结果要好得多,包括主题分类,情感分类,POS Taging和NER。这项工作为巴斯克地区的这些任务设定了新的最新。这项工作中使用的所有基准和模型均可公开使用。
Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.