论文标题
AlignTransFormer:视觉区域和疾病标签的分层对齐
AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation
论文作者
论文摘要
最近,旨在自动生成一份较长且连贯的描述性段落的医学报告产生,并获得了越来越多的研究兴趣。与一般图像字幕任务不同,医疗报告的生成对于数据驱动的神经模型更具挑战性。这主要是由于1)严重的数据偏见:正常的视觉区域在异常视觉区域上占主导地位,以及2)很长的序列。为了减轻上述两个问题,我们提出了一个AlignTransFormer框架,其中包括对齐分层的关注(AHA)和多元透明的变压器(MGT)模块:1)AHA模块首先预测输入图像中的疾病标签,然后通过层次进行视觉范围和疾病标签来了解多元化的视觉特征。获得后的疾病的视觉特征可以更好地表示输入图像的异常区域,这可能减轻数据偏置问题。 2)MGT模块有效地使用多透明功能和变压器框架来生成长期的医疗报告。公共IU-XRAR和MIMIC-CXR数据集进行的实验表明,AlignTransFormer可以通过两个数据集中的最新方法实现结果竞争。此外,专业放射科医生进行的人类评估进一步证明了我们方法的有效性。
Recently, medical report generation, which aims to automatically generate a long and coherent descriptive paragraph of a given medical image, has received growing research interests. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias: the normal visual regions dominate the dataset over the abnormal visual regions, and 2) the very long sequence. To alleviate above two problems, we propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules: 1) AHA module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The acquired disease-grounded visual features can better represent the abnormal regions of the input image, which could alleviate data bias problem; 2) MGT module effectively uses the multi-grained features and Transformer framework to generate the long medical report. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets. Moreover, the human evaluation conducted by professional radiologists further proves the effectiveness of our approach.