论文标题
视觉变压器如何工作?
How Do Vision Transformers Work?
论文作者
论文摘要
现在,多头自我引入(MSA)对于计算机视觉的成功是无可争议的。但是,对于MSA的工作方式知之甚少。我们提出了基本的解释,以帮助更好地理解MSA的性质。特别是,我们证明了MSA和视觉变压器(VIT)的以下特性:(1)MSA不仅提高了准确性,而且通过使损失景观扁平化来提高概括性。这种改进主要归因于其数据特异性,而不是长期依赖性。另一方面,VIT遭受了非凸损的损失。大型数据集和损失景观平滑方法可以减轻此问题; (2)MSA和COVS表现出相反的行为。例如,MSA是低通滤波器,但CORVS是高通滤波器。因此,MSA和CORVS是互补的。 (3)多阶段神经网络的行为就像小型模型的一系列连接。此外,舞台结束时的MSA在预测中起着关键作用。基于这些见解,我们提出了替代方案,该模型在阶段末尾的Cons块被MSA块替换。替代表的表现不仅超过了大型数据制度中的CNN,而且在小型数据制度中也优于CNN。该代码可在https://github.com/xxxnell/how-do-vits-work上找到。
The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.