从学生那里隐藏什么：注意引导的蒙版图像建模

论文标题

从学生那里隐藏什么：注意引导的蒙版图像建模

What to Hide from Your Students: Attention-Guided Masked Image Modeling

论文作者

Kakogeorgiou, Ioannis, Gidaris, Spyros, Psomas, Bill, Avrithis, Yannis, Bursuc, Andrei, Karantzalos, Konstantinos, Komodakis, Nikos

论文摘要

变形金刚和蒙版语言建模在计算机视觉中很快被视为视觉变压器和蒙版图像建模（MIM）。在这项工作中，我们认为由于图像中令牌的数量和相关性，图像令牌掩蔽与文本中的令牌掩盖有所不同。特别是，为了为MIM产生具有挑战性的借口任务，我们主张从随机掩盖到知情掩盖的转变。我们在基于蒸馏的MIM的背景下开发并展示了这一想法，其中教师变压器编码器生成了一个注意力图，我们用它来指导学生掩盖掩盖学生。因此，我们引入了一种新颖的掩蔽策略，称为注意引导掩盖（ATTMASK），并证明了其对基于密集蒸馏的MIM的随机掩蔽的有效性，以及基于普通蒸馏的分类代币的自我监督学习。我们确认attmask可以加快学习过程并提高各种下游任务的性能。我们在https://github.com/gkakogeorgiou/attmask上提供实现代码。

Transformers and masked language modeling are quickly being adopted and explored in computer vision as vision transformers and masked image modeling (MIM). In this work, we argue that image token masking differs from token masking in text, due to the amount and correlation of tokens in an image. In particular, to generate a challenging pretext task for MIM, we advocate a shift from random masking to informed masking. We develop and exhibit this idea in the context of distillation-based MIM, where a teacher transformer encoder generates an attention map, which we use to guide masking for the student. We thus introduce a novel masking strategy, called attention-guided masking (AttMask), and we demonstrate its effectiveness over random masking for dense distillation-based MIM as well as plain distillation-based self-supervised learning on classification tokens. We confirm that AttMask accelerates the learning process and improves the performance on a variety of downstream tasks. We provide the implementation code at https://github.com/gkakogeorgiou/attmask.

下载PDF全文

下载文献需遵守相关版权规定

论文标题