论文标题

Multiconer:用于复杂命名实体识别的大型多语言数据集

MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

论文作者

Malmasi, Shervin, Fang, Anjie, Fetahu, Besnik, Kar, Sudipta, Rokhlenko, Oleg

论文摘要

我们提出了多语言数据集的Multiconer,该数据集涉及11种语言的3个域(Wiki句子,问题和搜索查询)以及多语言和代码混合子集。该数据集旨在代表NER中的当代挑战,包括低文化场景(简短和未添加的文本),句法复杂的实体(例如电影标题)和长尾实体分布。使用基于启发式的句子采样,模板提取和插槽以及机器翻译等技术,从公共资源中汇编了26m令牌数据集。我们在数据集上应用了两个NER模型:一个基线XLM-Roberta模型和一个利用Gazetteers的最先进的Gemnet模型。基线实现了中等的性能(Macro-F1 = 54%),突出了我们数据的难度。 Gemnet使用Gazetteers,大幅改进(宏F1的平均改善=+30%)。即使对于大型预训练的语言模型,多官员也会构成挑战,我们认为它可以帮助进一步研究建立强大的NER系统。 Multiconer可在https://registry.opendata.aws/multiconer/上公开获取,我们希望该资源将有助于推进NER各个方面的研究。

We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源