使用生物医学变压器合奏和零/少量学习的医学编码

论文标题

使用生物医学变压器合奏和零/少量学习的医学编码

Medical Coding with Biomedical Transformer Ensembles and Zero/Few-shot Learning

论文作者

Ziletti, Angelo, Akbik, Alan, Berns, Christoph, Herold, Thomas, Legler, Marion, Viell, Martina

论文摘要

医疗编码（MC）是可靠数据检索和报告的必要前提。鉴于自由文本报道的术语（RT），例如“膝盖右大腿疼痛”，任务是从一个非常大的标准化医学术语的库存库中确定匹配的最低级术语（LLT） - 在这种情况下为“单侧腿痛”。但是，由于大量的LLT代码（撰写了80,000多个LLT代码），长时间/新兴类别的培训数据的可用性有限以及医疗领域的一般高精度需求。在本文中，我们介绍了MC任务，讨论其挑战，并提出了一种名为XTAR的新方法，该方法将基于Bert的传统分类与最近的零/少数学习方法（TARS）相结合。我们提出了广泛的实验，这些实验表明我们的合并方法的表现优于强大的基线，尤其是在少数拍摄方案中。该方法是自2021年11月以来一直在拜耳开发和部署的。我们认为，我们认为我们的方法可能有望超越MC，并确保可重复性，我们将代码发布给研究界。

Medical coding (MC) is an essential pre-requisite for reliable data retrieval and reporting. Given a free-text reported term (RT) such as "pain of right thigh to the knee", the task is to identify the matching lowest-level term (LLT) - in this case "unilateral leg pain" - from a very large and continuously growing repository of standardized medical terms. However, automating this task is challenging due to a large number of LLT codes (as of writing over 80,000), limited availability of training data for long tail/emerging classes, and the general high accuracy demands of the medical domain. With this paper, we introduce the MC task, discuss its challenges, and present a novel approach called xTARS that combines traditional BERT-based classification with a recent zero/few-shot learning approach (TARS). We present extensive experiments that show that our combined approach outperforms strong baselines, especially in the few-shot regime. The approach is developed and deployed at Bayer, live since November 2021. As we believe our approach potentially promising beyond MC, and to ensure reproducibility, we release the code to the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题