论文标题
CORVMAE:蒙面的卷积会遇到蒙面的自动编码器
ConvMAE: Masked Convolution Meets Masked Autoencoders
论文作者
论文摘要
视觉变形金刚(VIT)成为各种视觉任务的广泛型结构。用于特征预处理和多尺度混合卷积转换器体系结构的掩盖自动编码可以进一步释放VIT的潜力,从而导致图像分类,检测和语义细分的最新性能。在本文中,我们的Convmae框架表明,多尺度混合卷积转换器可以通过掩码自动编码方案学习更多的判别性表示。但是,直接使用原始掩蔽策略会导致繁重的计算成本和预处理的差异。为了解决这个问题,我们采用蒙版卷积以防止卷积块中的信息泄漏。提出了一种简单的块掩蔽策略,以确保计算效率。我们还建议更直接监督编码器的多尺度功能,以增强多尺度功能。根据我们预算的Convmae模型,与MAE基础相比,Convmae-Base将Imagenet-1K冠心的精度提高了1.4%。在对象检测中,Convmae-base仅以25个时期的填充,超过100个时期的MAE基本,分别超过2.9%的盒子AP和2.2%的掩码AP。代码和预估计的模型可在https://github.com/alpha-vl/convmae上找到。
Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.