论文标题
TCM-SD:通过自然语言处理探测综合征分化的基准
TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing
论文作者
论文摘要
中医(TCM)是一种自然,安全且有效的疗法,已在全球范围内传播和应用。独特的TCM诊断和治疗系统需要对隐藏在自由文本编写的临床记录中的患者症状进行全面分析。先前的研究表明,该系统可以借助人工智能(AI)技术进行通知和智能,例如自然语言处理(NLP)。但是,现有的数据集没有足够的质量或数量来支持TCM中数据驱动的AI技术的进一步开发。因此,在本文中,我们专注于TCM诊断和治疗系统的核心任务 - 综合征分化(SD) - 我们介绍了第一个公共公共大规模数据集,用于SD,称为TCM-SD。我们的数据集包含54,152个现实世界临床记录,其中包括148个综合征。此外,我们在TCM领域收集了一个大规模的未标记文本语料库,并提出了一种特定领域的预训练的语言模型,称为Zy-Bert。我们使用深层神经网络进行了实验,以建立强大的性能基线,揭示了SD中的各种挑战,并证明了特定于领域的预训练语言模型的潜力。我们的研究和分析揭示了将计算机科学和语言学知识纳入探索TCM理论的经验有效性的机会。
Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system -- syndrome differentiation (SD) -- and we introduce the first public large-scale dataset for SD, called TCM-SD. Our dataset contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.