论文标题
VITAEV2:视觉变压器通过探索图像识别及其他
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
论文作者
论文摘要
视觉变压器在各种计算机视觉任务中表现出巨大的潜力,因为它们强大的能力可以使用自我发项机制对远程依赖进行建模。然而,他们将图像视为视觉令牌的一维序列,在建模局部视觉结构并处理尺度差异时缺乏固有的感应偏置(IB),而是通过更长的培训时间表隐含地从大规模的培训数据中学到。在本文中,我们提出了一个视觉变压器,通过探索卷积的固有IB,即Vitae。从技术上讲,VITAE具有几个空间金字塔还原模块,以使用不同的扩张速率的多个卷积,将输入图像嵌入了具有丰富的多尺度上下文的令牌中。通过这种方式,它获得了固有的量表不变性ib,并且可以在各种尺度上学习对象的稳健功能表示形式。此外,在每个变压器层中,Vitae具有平行于多头自我注意力发项模块的卷积块,该模块的特征被融合并馈入进料向前网络。因此,它具有内在的局部性IB,并且能够协作学习本地功能和全球依赖性。提出的两种细胞都堆叠在各向同性和多阶段的方式中,以制定两个Vitae模型的家族,即Vanilla Vitae和Vitaev2。 ImageNet数据集上的实验以及MS Coco,ADE20K和AP10K数据集的下游任务验证了我们模型的优势,而不是基线变压器模型和并发工作。此外,我们将VITAE模型扩展到644m参数,并获得最新的分类性能,即ImageNet验证集对88.5%的Top-1分类精度,以及最佳的91.2%ImageNet真实验证集中的最佳91.2%Top-1精度,而无需使用额外的私人数据。
Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and can learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. The proposed two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline transformer models and concurrent works. Besides, we scale up our ViTAE model to 644M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set, without using extra private data.