论文标题
Pathologybert-预培训的VS。病理领域的新变压器语言模型
PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain
论文作者
论文摘要
鉴于癌症子类型定义的报告可变性和不断的新发现,病理文本挖掘是一项具有挑战性的任务。但是,大型病理数据库的成功文本挖掘可以发挥至关重要的作用,以推进“大数据”癌症研究,例如基于相似性的治疗选择,病例识别,预测,监视,临床试验筛查,风险分层等。尽管人们对为更具体的临床领域开发语言模型的兴趣越来越大,但没有特定于病理学的语言空间来支持病理空间的快速数据开发。在文献中,在维护原始令牌的同时,在文献上进行了一些微调的通用变压器模型,但是在需要专业术语的领域中,这些模型通常无法充分执行。我们提出了Pathology,这是一种经过预先训练的蒙版语言模型,对347,173个组织病理学标本报告进行了培训,并在Huggingface储存库中公开发布。我们的全面实验表明,与非特异性语言模型相比,在病理语料库中对变压器模型的预训练可以改善自然语言理解(NLU)和乳腺癌诊断。
Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.