主题意识到高质量短语提取的上下文化嵌入

论文标题

主题意识到高质量短语提取的上下文化嵌入

Topic Aware Contextualized Embeddings for High Quality Phrase Extraction

论文作者

V, Venktesh, Mohania, Mukesh, Goyal, Vikram

论文摘要

从给定文档中提取的键形是自动提取最能描述文档的显着短语的任务。本文提出了一种基于图形的新型排名方法，以从给定文档中提取高质量的短语。我们从预训练的语言模型中获取上下文化的嵌入，这些模型富含来自潜在的dirichlet分配（LDA）的主题向量，以表示候选短语和文档。我们使用从上下文化的嵌入和主题向量获得的信息引入了对短语的评分机制。使用为给定文档构建的无方向图上的排名算法提取显着短语。在无向图中，节点表示短语，短语之间的边缘表示它们之间的语义相关性，并由从评分机制获得的得分加权。为了证明我们提出的方法的疗效，我们在科学领域的开源数据集上执行了多个实验，并观察到我们的新方法表现优于现有的基于无监督的基于基于嵌入的键形键酶提取方法。例如，在SEMEVAL2017数据集中，我们的方法将F1分数从0.2195（Embedrank）提高到前10位提取的键形键盘上的0.2819。研究了所提出的算法的几种变体，以确定它们对钥匙酶质量的影响。我们进一步证明了我们提出的方法收集其他高质量键形的能力，这些钥匙源是文档中不存在的外部知识库中不存在的能力，例如Wikipedia（例如Wikipedia）使用新发现的键形词来丰富文档。我们在注释文档的集合中评估了这一步骤。前10名扩展的键形键盘的F1得分为0.60，表明我们的算法也可以用于使用外部知识的“概念”扩展。

Keyphrase extraction from a given document is the task of automatically extracting salient phrases that best describe the document. This paper proposes a novel unsupervised graph-based ranking method to extract high-quality phrases from a given document. We obtain the contextualized embeddings from pre-trained language models enriched with topic vectors from Latent Dirichlet Allocation (LDA) to represent the candidate phrases and the document. We introduce a scoring mechanism for the phrases using the information obtained from contextualized embeddings and the topic vectors. The salient phrases are extracted using a ranking algorithm on an undirected graph constructed for the given document. In the undirected graph, the nodes represent the phrases, and the edges between the phrases represent the semantic relatedness between them, weighted by a score obtained from the scoring mechanism. To demonstrate the efficacy of our proposed method, we perform several experiments on open source datasets in the science domain and observe that our novel method outperforms existing unsupervised embedding based keyphrase extraction methods. For instance, on the SemEval2017 dataset, our method advances the F1 score from 0.2195 (EmbedRank) to 0.2819 at the top 10 extracted keyphrases. Several variants of the proposed algorithm are investigated to determine their effect on the quality of keyphrases. We further demonstrate the ability of our proposed method to collect additional high-quality keyphrases that are not present in the document from external knowledge bases like Wikipedia for enriching the document with newly discovered keyphrases. We evaluate this step on a collection of annotated documents. The F1-score at the top 10 expanded keyphrases is 0.60, indicating that our algorithm can also be used for 'concept' expansion using external knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题