Mono vs多语言BERT：印地语和马拉地语中的案例研究名称实体识别

论文标题

Mono vs多语言BERT：印地语和马拉地语中的案例研究名称实体识别

Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

论文作者

Litake, Onkar, Sabane, Maithili, Patil, Parth, Ranade, Aparna, Joshi, Raviraj

论文摘要

命名实体识别（NER）是识别和分类文本中重要信息（实体）的过程。专有名词，例如一个人的名字，组织的名称或位置的名称，就是实体的示例。 NER是人力资源，客户支持，搜索引擎，内容分类和学术界等应用程序中的重要模块之一。在这项工作中，我们考虑使用诸如印地语和马拉地语这样的低资源印度语言。基于变压器的模型已被广泛用于NER任务。我们考虑了伯特（Bast-Bert），罗伯塔（Roberta）和阿尔伯特（Albert）等伯特（Bert）的不同变化，并在公开可用的印地语和马拉地语数据集上进行了基准测试。我们对不同的单语和多语言变压器模型进行了详尽的比较，并建立了文献中当前缺少的简单基线。我们表明，单语Maharoberta模型对Marathi Ner表现最好，而多语言XLM-Roberta为印地语而言是最佳的。我们还执行跨语言评估并呈现混合观察结果。

Named entity recognition (NER) is the process of recognising and classifying important information (entities) in text. Proper nouns, such as a person's name, an organization's name, or a location's name, are examples of entities. The NER is one of the important modules in applications like human resources, customer support, search engines, content classification, and academia. In this work, we consider NER for low-resource Indian languages like Hindi and Marathi. The transformer-based models have been widely used for NER tasks. We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets. We provide an exhaustive comparison of different monolingual and multilingual transformer-based models and establish simple baselines currently missing in the literature. We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER. We also perform cross-language evaluation and present mixed observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题