论文标题
UMLSBERT:使用统一的医学语言系统Metathesaurus的临床领域知识增强上下文嵌入
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus
论文作者
论文摘要
上下文单词嵌入模型,例如Biobert和Bio_clinicalbert,通过将其预训练过程集中在域特异性语料库上,从而实现了生物医学自然语言处理任务的最新结果。但是,此类模型并未考虑专家领域知识。 在这项工作中,我们介绍了Umlsbert,这是一个上下文嵌入模型,该模型通过新颖的知识增强策略在预训练过程中整合域知识。更具体地说,使用统一的医学语言系统(UMLS)Metathesaurus对Umlsbert的增强是通过两种方式进行的:i)连接在UMLS中具有相同基础“概念”的单词,ii)在UMLS中利用语义群体知识来创建临床意义的输入嵌入。通过应用这两种策略,Umlsbert可以将临床领域知识编码为单词嵌入,并且在公共命名实体识别(NER)和临床自然语言推断临床NLP任务上胜过现有领域的特定模型。
Contextual word embedding models, such as BioBERT and Bio_ClinicalBERT, have achieved state-of-the-art results in biomedical natural language processing tasks by focusing their pre-training process on domain-specific corpora. However, such models do not take into consideration expert domain knowledge. In this work, we introduced UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process via a novel knowledge augmentation strategy. More specifically, the augmentation on UmlsBERT with the Unified Medical Language System (UMLS) Metathesaurus was performed in two ways: i) connecting words that have the same underlying `concept' in UMLS, and ii) leveraging semantic group knowledge in UMLS to create clinically meaningful input embeddings. By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models on common named-entity recognition (NER) and clinical natural language inference clinical NLP tasks.