对比度学习竞争对手通过特征蒸馏进行微调掩盖图像建模

论文标题

对比度学习竞争对手通过特征蒸馏进行微调掩盖图像建模

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

论文作者

Wei, Yixuan, Hu, Han, Xie, Zhenda, Zhang, Zheng, Cao, Yue, Bao, Jianmin, Chen, Dong, Guo, Baining

论文摘要

蒙版的图像建模（MIM）学习具有非常好的微调性能的表示形式，掩盖了先前普遍的预训练方法，例如图像分类，实例对比度学习和图像文本对齐。在本文中，我们表明，通过以功能蒸馏（FD）形式进行简单的后处理，可以显着改善这些预训练方法的劣质微调性能。功能蒸馏将旧表示形式转换为具有一些理想属性的新表示形式，就像MIM产生的表示一样。这些属性总共称为优化友好性，并通过一组与注意力和优化相关的诊断工具来识别和分析。借助这些属性，新表示表现出强烈的微调性能。具体而言，对比的自我监督学习方法在微调中与最先进的蒙版图像建模（MIM）算法一样具有竞争力。剪辑模型的微调性能也得到了显着提高，夹子VIT-L模型在Imagenet-1K分类中达到了89.0％的TOP-1精度。在30亿参数SWINV2-G模型上，微调精度提高了+1.5 miou / +1.1映射到61.4 MIOU / 64.2 ADE20K语义分割和可可对象检测的地图，并在两个基准标记上创建新记录。更重要的是，我们的工作为未来的研究提供了一种方法，可以将更多的精力集中在学习表现的通用性和可扩展性上，而不会与优化友好性相处，因为它可以很容易地增强。该代码将在https://github.com/swintransformer/feature-distillation上找到。

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题