通过查询短语表示形式自动创建指定实体识别数据集

论文标题

通过查询短语表示形式自动创建指定实体识别数据集

Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations

论文作者

Kim, Hyunjae, Yoo, Jaehyo, Yoon, Seunghyun, Kang, Jaewoo

论文摘要

最弱监督的指定实体识别（NER）模型依赖于专家提供的特定领域特定词典。在不存在字典的许多域中，这种方法是不可行的。虽然在最近的一项研究中，使用短语检索模型自动从Wikipedia检索的实体来构建伪词，但这些词典通常具有有限的覆盖范围，因为猎犬可能会检索流行的实体而不是稀有实体。在这项研究中，我们提出了一个新颖的框架HighGen，该框架生成具有高覆盖伪数字的NER数据集。具体来说，我们使用一种新颖的搜索方法创建了富含实体的词典，称为“嵌入搜索”，这鼓励猎犬搜索与各种实体密集填充的空间。此外，我们根据候选实体提及和实体类型之间的嵌入距离使用新的验证过程，以减少高覆盖词典产生的弱标记中的假阳性噪声。我们证明，HighGen在五个NER基准数据集中的平均F1得分胜过以前的最佳模型。

Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题