论文标题
在小数据集上弥合视觉变压器与卷积神经网络之间的差距
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
论文作者
论文摘要
视觉变压器(VIT)和卷积神经网络(CNN)在小型数据集中训练时仍然存在极端的性能差距,这是由于缺乏电感偏见而得出的。在本文中,我们进一步考虑了这个问题,并指出了感应偏见中VIT的两个弱点,即空间相关性和多样化的渠道表示。首先,在空间方面,对象在局部紧凑且相关,因此需要从令牌及其邻居中提取细粒度的特征。尽管缺乏数据阻碍,因此可以参加空间相关性。其次,在渠道方面,表示形式在不同的渠道上表现出多样性。但是,稀缺数据无法使VIT能够学习足够强大的表示形式以进行准确的识别。为此,我们提出动态混合视觉变压器(DHVT)作为增强两个电感偏见的解决方案。在空间方面,我们采用了混合结构,其中卷积被整合到贴片嵌入和多层感知器模块中,迫使该模型捕获令牌功能以及其相邻的功能。在频道方面,我们在MLP中引入了动态功能聚合模块,并在多头自我发项模块中进行了全新的“头标记”设计,以帮助重新校准通道表示并使不同的通道组表示相互作用。弱通道表示的融合形成了足够强大的代表来分类。通过这种设计,我们成功地消除了CNN和VIT之间的性能差距,我们的DHVT通过轻巧的型号实现了一系列最先进的性能,CIFAR-100的85.68%的参数为2280万,在Imagenet-1k上具有24.0m.0m的参数。代码可在https://github.com/arieseirack/dhvt上找到。
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.