论文标题

部分可观测时空混沌系统的无模型预测

SdAE: Self-distillated Masked Autoencoder

论文作者

Chen, Yabo, Liu, Yuchen, Jiang, Dongsheng, Zhang, Xiaopeng, Dai, Wenrui, Xiong, Hongkai, Tian, Qi

论文摘要

通过开发基于生成的自我监督学习(SSL)方法,例如Beit和Mae,如何通过掩盖输入图像的随机补丁并重建缺失信息来学习良好的表示形式,这已经增长了。但是,Beit和Peco需要一个“预先介绍”阶段,以生成用于掩盖补丁代表的离散代码手册。 MAE不需要预培训的代码簿流程,但是将像素设置为重建目标可能会引入前训练和下游任务之间的优化差距,即良好的重建质量可能并不总是会导致模型的高度描述能力。考虑到上述问题,在本文中,我们提出了一个简单的自鉴定的蒙面自动编码器网络,即SDAE。 SDAE由一个使用编码器解码器结构的学生分支组成,以重建缺失的信息,并提供一个师范的分支,从而产生蒙版令牌的潜在表示。我们还分析了如何从信息瓶颈的角度为教师分支机构建立潜在代表的好看法。之后,我们提出了一种多重掩蔽策略,以提供多个掩盖视图,并具有平衡信息以提高性能,这也可以降低计算复杂性。我们的方法很好地概括了:只有300个时期预训练,香草vit-base模型在Imagenet-1k分类上达到了84.1%的微调精度,在ADE20K细分方面进行了48.6 MIOU,而在可可检测上进行了48.9 MAP,可通过相当大的额度覆盖其他方法。代码可在https://github.com/abrahamyabo/sdae上找到。

With the development of generative-based self-supervised learning (SSL) approaches like BeiT and MAE, how to learn good representations by masking random patches of the input image and reconstructing the missing information has grown in concern. However, BeiT and PeCo need a "pre-pretraining" stage to produce discrete codebooks for masked patches representing. MAE does not require a pre-training codebook process, but setting pixels as reconstruction targets may introduce an optimization gap between pre-training and downstream tasks that good reconstruction quality may not always lead to the high descriptive capability for the model. Considering the above issues, in this paper, we propose a simple Self-distillated masked AutoEncoder network, namely SdAE. SdAE consists of a student branch using an encoder-decoder structure to reconstruct the missing information, and a teacher branch producing latent representation of masked tokens. We also analyze how to build good views for the teacher branch to produce latent representation from the perspective of information bottleneck. After that, we propose a multi-fold masking strategy to provide multiple masked views with balanced information for boosting the performance, which can also reduce the computational complexity. Our approach generalizes well: with only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification, 48.6 mIOU on ADE20K segmentation, and 48.9 mAP on COCO detection, which surpasses other methods by a considerable margin. Code is available at https://github.com/AbrahamYabo/SdAE.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源