论文标题
在开放域键形提取中捕获全球信息性
Capturing Global Informativeness in Open Domain Keyphrase Extraction
论文作者
论文摘要
开放域键形提取(KPE)旨在从没有域或质量限制的文档中提取键形,例如具有变化域和质量的网页。最近,神经方法在许多KPE任务中显示出令人鼓舞的结果,因为它们强大的能力对给定文档的上下文语义进行了建模。但是,我们从经验上表明,大多数神经KPE方法都喜欢以良好的短语(例如短且实体式的n-grams)提取键形,而不是开放域文档中全球信息丰富的钥匙串。本文介绍了基于预训练的语言模型建立的开放域KPE架构interkkpe,可以在提取钥匙响时既可以捕获本地词句和全球信息。 Contrinkpe通过估计其在整个文档中的信息性来学会对钥匙响进行排名,并在钥匙拼块任务上共同培训,以确保钥匙拼候选者的短语。在两个具有不同域的大型KPE数据集上进行了OpenKP和KP20K的实验,证明了interkkpe对开放域情景中不同预训练的变体的有效性。进一步的分析揭示了关节kpe在预测长和非实体键形的方面的显着优势,这对于先前的神经KPE方法而言是具有挑战性的。我们的代码可在https://github.com/thunlp/bert-kpe上公开获取。
Open-domain KeyPhrase Extraction (KPE) aims to extract keyphrases from documents without domain or quality restrictions, e.g., web pages with variant domains and qualities. Recently, neural methods have shown promising results in many KPE tasks due to their powerful capacity for modeling contextual semantics of the given documents. However, we empirically show that most neural KPE methods prefer to extract keyphrases with good phraseness, such as short and entity-style n-grams, instead of globally informative keyphrases from open-domain documents. This paper presents JointKPE, an open-domain KPE architecture built on pre-trained language models, which can capture both local phraseness and global informativeness when extracting keyphrases. JointKPE learns to rank keyphrases by estimating their informativeness in the entire document and is jointly trained on the keyphrase chunking task to guarantee the phraseness of keyphrase candidates. Experiments on two large KPE datasets with diverse domains, OpenKP and KP20k, demonstrate the effectiveness of JointKPE on different pre-trained variants in open-domain scenarios. Further analyses reveal the significant advantages of JointKPE in predicting long and non-entity keyphrases, which are challenging for previous neural KPE methods. Our code is publicly available at https://github.com/thunlp/BERT-KPE.