Multiconer：用于复杂命名实体识别的大型多语言数据集

论文标题

Multiconer：用于复杂命名实体识别的大型多语言数据集

MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

论文作者

Malmasi, Shervin, Fang, Anjie, Fetahu, Besnik, Kar, Sudipta, Rokhlenko, Oleg

论文摘要

我们提出了多语言数据集的Multiconer，该数据集涉及11种语言的3个域（Wiki句子，问题和搜索查询）以及多语言和代码混合子集。该数据集旨在代表NER中的当代挑战，包括低文化场景（简短和未添加的文本），句法复杂的实体（例如电影标题）和长尾实体分布。使用基于启发式的句子采样，模板提取和插槽以及机器翻译等技术，从公共资源中汇编了26m令牌数据集。我们在数据集上应用了两个NER模型：一个基线XLM-Roberta模型和一个利用Gazetteers的最先进的Gemnet模型。基线实现了中等的性能（Macro-F1 = 54％），突出了我们数据的难度。 Gemnet使用Gazetteers，大幅改进（宏F1的平均改善=+30％）。即使对于大型预训练的语言模型，多官员也会构成挑战，我们认为它可以帮助进一步研究建立强大的NER系统。 Multiconer可在https://registry.opendata.aws/multiconer/上公开获取，我们希望该资源将有助于推进NER各个方面的研究。

We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题