论文标题
目光注视引导的视觉变压器用于纠正快捷方式学习
Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning
论文作者
论文摘要
学习有害的捷径(例如虚假的相关性和偏见)阻止了深层的神经网络学习有意义和有用的表示形式,从而危害了学识渊博的表示的普遍性和可解释性。这种情况在医学成像中变得更加严重,在医学成像中,临床数据(例如,具有病理学的MR图像)是有限和稀缺的,而学习模型的可靠性,可靠性和透明度是高度必需的。为了解决这个问题,我们建议将人类专家的智力和领域知识注入深度神经网络的培训。核心思想是,我们从专家放射科医生中注入视觉注意信息,以主动指导深层模型,以专注于具有潜在病理学的区域,并避免被困在学习有害捷径中。为此,我们提出了一种新型的眼神引导的视觉变压器(EG-VIT),用于诊断有限的医疗图像数据。我们掩盖了来自放射科医生感兴趣的输入图像贴片,并在EG-VIT的最后一个编码器层中添加其他残留连接,以维持所有贴片的相关性。在Inbreast和Siim-Acr的两个公共数据集上进行的实验表明,我们的EG-VIT模型可以有效地学习/转移专家的领域知识,并取得比基线更好的绩效。同时,它成功纠正了有害的快捷方式学习,并显着提高了EG-VIT模型的可解释性。总的来说,例如,EG-VIT掌握了人类专家的先验知识和深度神经网络的力量的优势。这项工作开辟了新的途径,以通过注入人类智能来推动当前的人工智能范例。
Learning harmful shortcuts such as spurious correlations and biases prevents deep neural networks from learning the meaningful and useful representations, thus jeopardizing the generalizability and interpretability of the learned representation. The situation becomes even more serious in medical imaging, where the clinical data (e.g., MR images with pathology) are limited and scarce while the reliability, generalizability and transparency of the learned model are highly required. To address this problem, we propose to infuse human experts' intelligence and domain knowledge into the training of deep neural networks. The core idea is that we infuse the visual attention information from expert radiologists to proactively guide the deep model to focus on regions with potential pathology and avoid being trapped in learning harmful shortcuts. To do so, we propose a novel eye-gaze-guided vision transformer (EG-ViT) for diagnosis with limited medical image data. We mask the input image patches that are out of the radiologists' interest and add an additional residual connection in the last encoder layer of EG-ViT to maintain the correlations of all patches. The experiments on two public datasets of INbreast and SIIM-ACR demonstrate our EG-ViT model can effectively learn/transfer experts' domain knowledge and achieve much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the EG-ViT model's interpretability. In general, EG-ViT takes the advantages of both human expert's prior knowledge and the power of deep neural networks. This work opens new avenues for advancing current artificial intelligence paradigms by infusing human intelligence.