使用上下文语言模型合奏在化学专利中指定实体识别

论文标题

使用上下文语言模型合奏在化学专利中指定实体识别

Named entity recognition in chemical patents using ensemble of contextual language models

论文作者

Copara, Jenny, Naderi, Nona, Knafou, Julien, Ruch, Patrick, Teodoro, Douglas

论文摘要

化学专利文件描述了持有关键反应和化合物信息的广泛应用，例如化学结构，反应配方和分子特性。这些信息实体应首先在要在下游任务中使用的文本段落中确定。文本挖掘提供了通过信息提取技术从化学专利中提取相关信息的方法。作为ChemInformatics Essevier Melbourne University挑战的信息提取任务的一部分，在这项工作中，我们研究了情境化语言模型在化学专利中提取反应信息的有效性。我们评估了经过通用和专业语料库培训的变压器体系结构，以提出新的合奏模型。我们的最佳模型，基于多数合奏方法，达到92.30％的精确F1得分，而轻松的F1得分为96.24％。结果表明，情境化语言模型的合奏可以提供一种有效的方法来从化学专利中提取信息。

Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show that ensemble of contextualized language models can provide an effective method to extract information from chemical patents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题