视觉变压器的弯曲表示空间

论文标题

视觉变压器的弯曲表示空间

Curved Representation Space of Vision Transformers

论文作者

Kim, Juyeop, Park, Junha, Kim, Songkuk, Lee, Jong-Seok

论文摘要

具有自我注意力的神经网络（又称VIT和SWIN）已成为传统卷积神经网络（CNN）的更好替代方法。但是，我们对新体系结构工作原理的理解仍然受到限制。在本文中，我们关注的是变压器对腐败表现出比CNN更高的鲁棒性，同时又不过分自信。这与鲁棒性自信增加的直觉相反。我们通过经验研究倒数第二层的输出在表示空间中如何移动来解决这一矛盾，因为输入数据在小区域内线性移动。特别是，我们显示以下内容。（1）CNN在输入和输出运动之间表现出相当线性的关系，但变压器显示某些数据的非线性关系。对于这些数据，变压器的输出随着输入线性移动而移动弯曲轨迹。（2）当数据位于弯曲区域中时，由于输出沿弯曲轨迹而不是直线向决策边界移动，因此很难将其移出决策区域，从而导致变压器的鲁棒性。（3）如果数据稍微修改以跳出弯曲区域，则随后将移动变为线性，并且输出直接进入决策边界。换句话说，确实存在数据附近的决策边界，这仅仅是因为弯曲的表示空间很难找到。这解释了变形金刚的不受信心预测。此外，我们研究了诱导非线性响应对线性扰动的注意力操作的数学特性。最后，我们分享了我们的其他发现，即有助于变压器的弯曲表示空间以及训练过程中的曲线演变。

Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题