论文标题

奴才:用于多语言事件检测的大规模和多样化数据集

MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection

论文作者

Veyseh, Amir Pouran Ben, Van Nguyen, Minh, Dernoncourt, Franck, Nguyen, Thien Huu

论文摘要

事件检测(ED)是在文本中识别和分类事件提及的触发单词的任务。尽管近年来对英语文本进行了大量研究工作,但其他语言的ED任务却大大降低了。转向非英语语言,ED的重要研究问题包括现有的ED模型在不同语言上的表现,其他语言的挑战性如何以及ED知识和注释如何跨语言传递。要回答这些问题,获得多语言ED数据集至关重要,这些数据集为多种语言提供一致的事件注释。存在一些多语言ED数据集;但是,它们倾向于涵盖少数语言,主要专注于流行语言。现有的多语言ED数据集未涵盖许多语言。此外,当前数据集通常很小,公众无法访问。为了克服这些缺点,我们引入了一个新的大型多语言数据集(称为奴才),该数据集始终如一地注释8种不同语言的事件;其中5个没有得到现有的多语言数据集的支持。我们还进行了广泛的实验和分析,以证明ED跨语言中ED的挑战和转移性,在这方面都需要在这一领域进行更多的研究工作。

Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源