高度技术领域的无监督期限提取

论文标题

高度技术领域的无监督期限提取

Unsupervised Term Extraction for Highly Technical Domains

论文作者

Fusco, Francesco, Staar, Peter, Antognini, Diego

论文摘要

术语提取是知识发现平台根部的信息提取任务。开发能够在非常多样化和潜在高度技术领域中概括的术语提取器具有挑战性，因为需要深入专业知识的域名注释稀缺且昂贵。在本文中，我们描述了一个针对高科技领域的商业知识发现平台的提取子系统，例如制药，医学和材料科学。为了能够跨域进行概括，我们引入了一个完全无监督的注释者（UA）。它通过使用通用域预训练的句子编码器计算出的术语到序列相似性指标来提取术语，从而将新型的形态信号与术语到序列相似性指标相结合。注释者用于实施一个弱监督的设置，在该设置中，通过在大型未标记的Corpora上运行UA生成的培训数据，变压器模型对变压器模型进行了微调（或预训练）。我们的实验表明，我们的设置可以提高预测性能，同时降低CPU和GPU的推断潜伏期。我们的注释者为所有没有注释的情况提供了非常有竞争力的基线。

Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题