论文标题
埃索普斯:一个巨大的西班牙爬行语料库
esCorpius: A Massive Spanish Crawling Corpus
论文作者
论文摘要
近年来,基于变压器的模型已导致自然语言处理语言建模的重大进步。但是,它们需要大量的数据接受(预先)培训,并且除英语以外的语言中缺乏语料库。最近,一些计划提出了从自动网络爬行获得的多语言数据集。但是,西班牙语的结果具有重要的缺点,因为与其他语言相比,它们要么太小,要么呈现出从优化清洁和重复数据删除而得出的低质量。在本文中,我们介绍了Escorpius,这是一种西班牙爬行语料库,该语料库是从近1 pb的普通爬网数据中获得的。它是西班牙语中最广泛的语料库,其提取,纯化和重复删除Web文本内容的质量水平。我们的数据策划过程涉及一条新型高度平行的清洁管道,并包含一系列重复数据删除机制,以确保文档和段落边界的完整性。此外,我们同时维护源网页URL和WARC Shard Origin URL,以抱怨欧盟法规。 Escorpius已根据CC BY-NC-ND 4.0许可发布,可在HuggingFace上获得。
In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.