以混合实体为中心的波斯代词解决方案的方法

论文标题

以混合实体为中心的波斯代词解决方案的方法

A hybrid entity-centric approach to Persian pronoun resolution

论文作者

Mohammadi, Hassan Haji, Talebpour, Alireza, Aznaveh, Ahmad Mahmoudi, Yazdani, Samaneh

论文摘要

代词分辨率是自然语言处理中基本领域的挑战子集，称为Coreference解决方案。 Coreference解决方案是关于在文本中找到指的所有实体，这些实体是指相同的现实世界实体。本文提出了一个混合模型，该模型将基于多个规则的筛子与代词的机器学习筛子结合在一起。为此，为波斯语设计了七个基于规则的筛子。然后，随机的森林分类器将代词链接到先前的部分簇。提出的方法证明了使用管道设计的示例性能，并结合了机器学习和基于规则的方法的优势。该方法在端到端模型中解决了一些挑战。在本文中，作者以400个文档的形式开发了一个名为MEHR的波斯核心语料库。该语料库用波斯语修复了以前语音的一些弱点。最后，通过评估MEHR和Uppsala测试集的提议方法，报道了所提出的系统与早期模型相比，提出的系统的效率。

Pronoun resolution is a challenging subset of an essential field in natural language processing called coreference resolution. Coreference resolution is about finding all entities in the text that refers to the same real-world entity. This paper presents a hybrid model combining multiple rulebased sieves with a machine-learning sieve for pronouns. For this purpose, seven high-precision rule-based sieves are designed for the Persian language. Then, a random forest classifier links pronouns to the previous partial clusters. The presented method demonstrates exemplary performance using pipeline design and combining the advantages of machine learning and rulebased methods. This method has solved some challenges in end-to-end models. In this paper, the authors develop a Persian coreference corpus called Mehr in the form of 400 documents. This corpus fixes some weaknesses of the previous corpora in the Persian language. Finally, the efficiency of the presented system compared to the earlier model in Persian is reported by evaluating the proposed method on the Mehr and Uppsala test sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题