使用MEDCAT的多域临床自然语言处理：医学概念注释工具包

论文标题

使用MEDCAT的多域临床自然语言处理：医学概念注释工具包

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

论文作者

Kraljevic, Zeljko, Searle, Thomas, Shek, Anthony, Roguski, Lukasz, Noor, Kawsar, Bean, Daniel, Mascio, Aurelie, Zhu, Leilei, Folarin, Amos A, Roberts, Angus, Bendayan, Rebecca, Richardson, Mark P, Stewart, Robert, Shah, Anoop D, Wong, Wai Keong, Ibrahim, Zina, Teo, James T, Dobson, Richard JB

论文摘要

电子健康记录（EHR）包含大量非结构化文本，需要应用信息提取（IE）技术才能实现临床分析。我们提供了开源医学概念注释工具包（MEDCAT），该工具包提供：a）一种新颖的自我监管机器学习算法，用于使用任何概念词汇来提取概念，包括UMLS/SNOMED-CT； b）用于自定义和培训IE模型的功能丰富的注释接口； c）集成到更广泛的COGSTACK生态系统，以供供应商不合时宜的卫生系统部署。我们在从开放数据集中提取UMLS概念方面显示出改进的性能（F1：0.448-0.738 vs 0.429-0.650）。进一步的现实验证表明，在3家大型伦敦大型医院进行了snomed-ct提取，他们的临床记录约为8.8b单词的自我监督培训，并通过〜6K临床医生注释的例子进行了进一步的微调。我们在医院，数据集和概念类型之间显示出强大的可转移性（F1> 0.94），表明用于加速临床和研究用例的跨域EHR-AFNOSTIC实用程序。

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题