多久了？探索远程临床注释语言建模的最佳间隔

论文标题

多久了？探索远程临床注释语言建模的最佳间隔

How Long Is Enough? Exploring the Optimal Intervals of Long-Range Clinical Note Language Modeling

论文作者

Cahyawijaya, Samuel, Wilie, Bryan, Lovenia, Holy, Zhong, Huan, Zhong, MingQian, Ip, Yuk-Yu Nancy, Fung, Pascale

论文摘要

大型预训练的语言模型（LMS）已在生物医学和临床领域被广泛采用，引入了许多强大的LMS，例如Bio-LM和Bioelectra。但是，由于预先训练的LMS在用数千个单词处理长文本数据中限制了预训练的LMS，因此这些方法适用于实际临床用例，这是临床注释的常见长度。在这项工作中，我们探索了具有Longformer的LMS的远程适应，使LMS可以捕获更长的临床注释环境。我们对三个N2C2挑战数据集进行实验，并从香港医院管理局电子健康记录（EHR）系统进行纵向临床数据集，以显示此概念的有效性和概括性，从而实现了10 \％F1得分的改进。根据我们的实验，我们得出结论，捕获更长的临床音符间隔对模型性能有益，但是有不同的截止时间间隔可以实现不同目标变量的最佳性能。我们的代码可在https://github.com/hltchkust/long-biomedical-model上找到。

Large pre-trained language models (LMs) have been widely adopted in biomedical and clinical domains, introducing many powerful LMs such as bio-lm and BioELECTRA. However, the applicability of these methods to real clinical use cases is hindered, due to the limitation of pre-trained LMs in processing long textual data with thousands of words, which is a common length for a clinical note. In this work, we explore long-range adaptation from such LMs with Longformer, allowing the LMs to capture longer clinical notes context. We conduct experiments on three n2c2 challenges datasets and a longitudinal clinical dataset from Hong Kong Hospital Authority electronic health record (EHR) system to show the effectiveness and generalizability of this concept, achieving 10\% F1-score improvement. Based on our experiments, we conclude that capturing a longer clinical note interval is beneficial to the model performance, but there are different cut-off intervals to achieve the optimal performance for different target variables. Our code is available at https://github.com/HLTCHKUST/long-biomedical-model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题