层：剪辑型模型的文本图像熵正则化

论文标题

层：剪辑型模型的文本图像熵正则化

TIER: Text-Image Entropy Regularization for CLIP-style models

论文作者

Palepu, Anil, Beam, Andrew L.

论文摘要

在本文中，我们介绍了一种新颖的正规化计划，以对比性语言图像预训练（剪辑）医学视觉模型。我们的方法是基于这样的观察结果：在许多医学成像任务上，文本令牌只能描述少数图像区域，同样，每个图像区域应仅与几个文本令牌相对应。在剪辑式模型中，这意味着文本嵌入的嵌入应与给定图像文本对的少量图像点嵌入具有很高的相似性。我们使用一种新颖的正规化方案对这一观察进行了正式的观察，该方案惩罚了文本与图像斑点相似性分数的熵。我们在定性和定量上证明，提出的正则化方案将大多数成对文本式和图像斑点相似性得分缩小到零，从而达到了所需的效果。我们在重要的医疗环境，胸部X射线上展示了我们方法的希望，在这种情况下，这种潜在的稀疏假设自然出现。使用我们提出的方法，我们在CHEXPERT和PADCHEST CHEST X射线数据集上实现了最新技术（SOTA）的平均零拍性能，表现优于该模型的未注册版本，并且几种最近发布的自我监督模型。

In this paper, we introduce a novel regularization scheme on contrastive language-image pre-trained (CLIP) medical vision models. Our approach is based on the observation that on many medical imaging tasks text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks most of the pairwise text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context, chest x-rays, where this underlying sparsity hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) average zero-shot performance on the CheXpert and Padchest chest x-ray datasets, outperforming an unregularized version of the model and several recently published self-supervised models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题