Pathologybert-预培训的VS。病理领域的新变压器语言模型

论文标题

Pathologybert-预培训的VS。病理领域的新变压器语言模型

PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain

论文作者

Santos, Thiago, Tariq, Amara, Das, Susmita, Vayalpati, Kavyasree, Smith, Geoffrey H., Trivedi, Hari, Banerjee, Imon

论文摘要

鉴于癌症子类型定义的报告可变性和不断的新发现，病理文本挖掘是一项具有挑战性的任务。但是，大型病理数据库的成功文本挖掘可以发挥至关重要的作用，以推进“大数据”癌症研究，例如基于相似性的治疗选择，病例识别，预测，监视，临床试验筛查，风险分层等。尽管人们对为更具体的临床领域开发语言模型的兴趣越来越大，但没有特定于病理学的语言空间来支持病理空间的快速数据开发。在文献中，在维护原始令牌的同时，在文献上进行了一些微调的通用变压器模型，但是在需要专业术语的领域中，这些模型通常无法充分执行。我们提出了Pathology，这是一种经过预先训练的蒙版语言模型，对347,173个组织病理学标本报告进行了培训，并在Huggingface储存库中公开发布。我们的全面实验表明，与非特异性语言模型相比，在病理语料库中对变压器模型的预训练可以改善自然语言理解（NLU）和乳腺癌诊断。

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题