论文标题
几次变压器的自我宣传的监督
Self-Promoted Supervision for Few-Shot Transformer
论文作者
论文摘要
视觉变压器(VIT)的几乎没有的学习能力很少进行,尽管有很大的需求。在这项工作中,我们从经验上发现,使用相同的少量学习框架,例如\ 〜Meta-Baseline,用VIT模型代替了广泛使用的CNN特征提取器,通常严重损害了很少的分类性能。此外,我们的实证研究表明,在没有归纳偏见的情况下,VIT经常学习在很少的学习制度下,只有少数标记的培训数据可获得较低的代币依赖,这在很大程度上会导致上述性能降级。为了减轻这个问题,我们首次提出了一个简单而有效的几杆培训框架,即自我推广的监督(Sun)。具体而言,除了对全球语义学习的常规监督外,Sun还进一步预处理了少量学习数据集的VIT,然后使用它来生成各个位置特定的监督,以指导每个补丁令牌。该特定于位置的监督告诉VIT哪些贴片令牌相似或不同,因此可以加速令牌依赖的依赖学习。此外,它将每个贴片令牌中的本地语义建模,以提高对象接地和识别能力,从而有助于学习可概括的模式。为了提高特定于位置的监督的质量,我们进一步提出了两种技术:〜1)背景补丁过滤以滤掉背景补丁并将其分配为额外的背景类别; 2)空间一致的增强,以引入足够的多样性以增加数据,同时保持生成的本地监督的准确性。实验结果表明,使用VITS的太阳显着超过了其他具有VIT的几次学习框架,并且是第一个获得比CNN最先进的效果更高的性能。
The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, \eg~Meta-Baseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques:~1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatial-consistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.