Cluecorpus2020：用于培训前语言模型的大型中国语料库

论文标题

Cluecorpus2020：用于培训前语言模型的大型中国语料库

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

论文作者

Xu, Liang, Zhang, Xuanwei, Dong, Qianqian

论文摘要

在本文中，我们介绍了线索组织Cluecorpus2020的中国语料库，这是一种大规模的语料库，可直接用于自我监督的学习，例如语言模型的预培训或语言产生。它具有100克原始语料库，其中有350亿个汉字，这是从普通爬网中检索出来的。为了更好地理解这种语料库，我们对小规模和大规模进行语言理解实验，结果表明，在该语料库中训练的模型可以在中文上取得出色的表现。我们发布了一个新的中国词汇，其大小为8K，这仅是Google发行的中文Bert中使用的词汇量的三分之一。它可以节省计算成本和内存，而与原始词汇一样好。我们还发布了该语料库的预训练模型的大型和微小版本。前者取得了最先进的结果，而后者则保持最精度，而与Bert-Base相比，加速训练和预测速度八倍。为了促进对中文的自我监督学习的未来工作，我们在Github上发布了数据集，新词汇，代码和预训练的模型。

In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl. To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result, and the latter retains most precision while accelerating training and prediction speed for eight times compared to Bert-base. To facilitate future work on self-supervised learning on Chinese, we release our dataset, new vocabulary, codes, and pre-trained models on Github.

下载PDF全文

下载文献需遵守相关版权规定

论文标题