论文标题
名义复合链提取:富含语义的词汇链的新任务
Nominal Compound Chain Extraction: A New Task for Semantic-enriched Lexical Chain
论文作者
论文摘要
词汇链由文档中的凝聚词组成,这意味着文本的基本结构,从而促进了下游NLP任务。然而,现有的工作着重于检测具有浅层语法关联的简单表面词典,忽略了语义吸引的词汇化合物以及潜在的语义框架(例如主题),这对于现实世界中的NLP应用程序可能更为重要。在本文中,我们介绍了一项新的任务,名义复合链提取(NCCE),提取和聚类所有共享相同语义主题的名义化合物。此外,我们将任务建模为两阶段预测(即化合物提取和链检测),该预测是通过建议的关节框架来处理的。该模型采用BERT编码器来产生上下文化的文档表示。此外,Hownet被利用为外部资源,以提供丰富的半ememe信息。这些实验基于我们的手动注释语料库,结果证明了NCCE任务的必要性以及我们联合方法的有效性。
Lexical chain consists of cohesion words in a document, which implies the underlying structure of a text, and thus facilitates downstream NLP tasks. Nevertheless, existing work focuses on detecting the simple surface lexicons with shallow syntax associations, ignoring the semantic-aware lexical compounds as well as the latent semantic frames, (e.g., topic), which can be much more crucial for real-world NLP applications. In this paper, we introduce a novel task, Nominal Compound Chain Extraction (NCCE), extracting and clustering all the nominal compounds that share identical semantic topics. In addition, we model the task as a two-stage prediction (i.e., compound extraction and chain detection), which is handled via a proposed joint framework. The model employs the BERT encoder to yield contextualized document representation. Also, HowNet is exploited as external resources for offering rich sememe information. The experiments are based on our manually annotated corpus, and the results prove the necessity of the NCCE task as well as the effectiveness of our joint approach.