论文标题
Parsinlu:一套语言理解波斯的挑战
ParsiNLU: A Suite of Language Understanding Challenges for Persian
论文作者
论文摘要
尽管近年来在应对自然语言理解(NLU)挑战方面取得了进展,但大多数进展仍尚待集中于诸如英语之类的资源丰富的语言。这项工作的重点是波斯语,这是世界上广泛使用的语言之一,但是很少有NLU数据集可用于这种丰富的语言。高质量评估数据集的可用性是对不同NLU任务和域上进度的可靠评估的必要性。我们介绍了Parsinlu,这是波斯语中的第一个基准,其中包括一系列高级任务 - 阅读理解,文本需要等。这些数据集以多种方式收集,通常涉及母语人士的手动注释。这导致超过14.5 $ k $的新实例,包括6个不同的NLU任务。此外,我们在此基准上介绍了最先进的单语和多语言预训练的语言模型,并将其与人类绩效进行比较,并将其与人类的绩效进行比较,这为我们解决波斯语自然语言理解挑战的能力提供了宝贵的见解。我们希望Parsinlu促进进一步的研究和波斯语言理解的进步。
Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.