知识注入多标签的基于及时的微调少数弹药ICD编码

论文标题

知识注入多标签的基于及时的微调少数弹药ICD编码

Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding

论文作者

Yang, Zhichao, Wang, Shufan, Rawat, Bhanu Pratap Singh, Mitra, Avijit, Yu, Hong

论文摘要

自动国际疾病分类（ICD）编码旨在将多个ICD代码分配给平均长度3,000多个令牌的医学票据。由于多标签分配的高维空间（成千上万的ICD代码）和长尾挑战：只有少数代码（常见疾病）经常分配，而大多数代码（罕见疾病）经常被分配，因此此任务具有挑战性。这项研究通过将基于迅速的微调技术与标签语义进行调整，以解决长尾挑战，该技术已被证明在少量设置下是有效的。为了进一步提高医疗领域的表现，我们通过注入三个特定领域的知识来提出一个知识增强的长形式：层次结构，同义词和缩写，并使用对比度学习进行了额外的预处理。关于MIMIC-III-FULL的实验，代码分配的基准数据集表明，我们所提出的方法在Marco F1中以14.5％（从10.3到11.8，p <0.001）优于先前的最新方法。为了进一步测试我们的模型，我们创建了一种新的稀有疾病编码数据集Mimic-III-Rare50，与以前的方法相比，我们的模型将Marco F1从17.1提高到30.4，并从17.2提高到32.6。

Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with average length of 3,000+ tokens. This task is challenging due to a high-dimensional space of multi-label assignment (tens of thousands of ICD codes) and the long-tail challenge: only a few codes (common diseases) are frequently assigned while most codes (rare diseases) are infrequently assigned. This study addresses the long-tail challenge by adapting a prompt-based fine-tuning technique with label semantics, which has been shown to be effective under few-shot setting. To further enhance the performance in medical domain, we propose a knowledge-enhanced longformer by injecting three domain-specific knowledge: hierarchy, synonym, and abbreviation with additional pretraining using contrastive learning. Experiments on MIMIC-III-full, a benchmark dataset of code assignment, show that our proposed method outperforms previous state-of-the-art method in 14.5% in marco F1 (from 10.3 to 11.8, P<0.001). To further test our model on few-shot setting, we created a new rare diseases coding dataset, MIMIC-III-rare50, on which our model improves marco F1 from 17.1 to 30.4 and micro F1 from 17.2 to 32.6 compared to previous method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题