论文标题
使用MEDCAT的多域临床自然语言处理:医学概念注释工具包
Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
论文作者
论文摘要
电子健康记录(EHR)包含大量非结构化文本,需要应用信息提取(IE)技术才能实现临床分析。我们提供了开源医学概念注释工具包(MEDCAT),该工具包提供:a)一种新颖的自我监管机器学习算法,用于使用任何概念词汇来提取概念,包括UMLS/SNOMED-CT; b)用于自定义和培训IE模型的功能丰富的注释接口; c)集成到更广泛的COGSTACK生态系统,以供供应商不合时宜的卫生系统部署。我们在从开放数据集中提取UMLS概念方面显示出改进的性能(F1:0.448-0.738 vs 0.429-0.650)。进一步的现实验证表明,在3家大型伦敦大型医院进行了snomed-ct提取,他们的临床记录约为8.8b单词的自我监督培训,并通过〜6K临床医生注释的例子进行了进一步的微调。我们在医院,数据集和概念类型之间显示出强大的可转移性(F1> 0.94),表明用于加速临床和研究用例的跨域EHR-AFNOSTIC实用程序。
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.