论文标题
注意力蒸馏:自我监督的视觉变形金刚学生需要更多的指导
Attention Distillation: self-supervised vision transformer students need more guidance
论文作者
论文摘要
自我监督的学习已被广泛应用于训练高质量的视觉变压器。因此,在记忆和计算约束设备上释放出色的性能是一个重要的研究主题。但是,如何将知识从一个自我监督的VIT提炼到另一个自我保护的VIT。此外,现有的自我监督知识蒸馏(SSKD)方法的重点是基于Convnet的建筑,对于VIT知识蒸馏而言是最佳的。在本文中,我们研究了自我监督视觉变压器(VIT-SSKD)的知识蒸馏。我们表明,从教师到学生的关键注意机制直接提炼信息可以显着缩小两者之间的性能差距。在有关Imagenet-Subset和Imagenet-1K的实验中,我们表明我们的方法的Attndistill优于现有的自我监督知识蒸馏(SSKD)方法(SSKD)方法,并实现与自我监督学习(SSL)方法(SSL)方法(使用Scratch(使用VIT-S模型)学习相比)。我们也是第一个将微小的VIT-T模型应用于自我监督的学习中的人。此外,attndistill独立于自学学习算法,它可以适用于基于VIT的SSL方法,以改善未来研究的性能。代码在这里:https://github.com/wangkai930418/attndistill
Self-supervised learning has been widely applied to train high-quality vision transformers. Unleashing their excellent performance on memory and compute constraint devices is therefore an important research topic. However, how to distill knowledge from one self-supervised ViT to another has not yet been explored. Moreover, the existing self-supervised knowledge distillation (SSKD) methods focus on ConvNet based architectures are suboptimal for ViT knowledge distillation. In this paper, we study knowledge distillation of self-supervised vision transformers (ViT-SSKD). We show that directly distilling information from the crucial attention mechanism from teacher to student can significantly narrow the performance gap between both. In experiments on ImageNet-Subset and ImageNet-1K, we show that our method AttnDistill outperforms existing self-supervised knowledge distillation (SSKD) methods and achieves state-of-the-art k-NN accuracy compared with self-supervised learning (SSL) methods learning from scratch (with the ViT-S model). We are also the first to apply the tiny ViT-T model on self-supervised learning. Moreover, AttnDistill is independent of self-supervised learning algorithms, it can be adapted to ViT based SSL methods to improve the performance in future research. The code is here: https://github.com/wangkai930418/attndistill