论文标题
双视力变压器
Dual Vision Transformer
论文作者
论文摘要
先前的工作提出了几种策略,以降低自我注意机制的计算成本。这些作品中的许多工作都考虑将自我关注程序分解为区域和局部特征提取程序,这些程序都会产生较小的计算复杂性。但是,通常只能以由于下采样而丢失的不良信息来实现区域信息。在本文中,我们提出了一种新颖的变压器体系结构,旨在减轻成本问题,称为双视觉变压器(双击)。新的体系结构结合了一个关键的语义途径,该途径可以更有效地将代币向量压缩到具有降低的复杂性顺序的全球语义中。然后,通过另一个构造的像素途径,这种压缩的全局语义是在学习更精细的像素级详细信息中的有用先前信息。然后将语义途径和像素途径集成在一起并共同训练,并通过这两个途径平行传播增强的自我运动信息。此后,双攻击能够降低计算复杂性,而不会损害太多准确性。我们从经验上证明,双重VIT比SOTA变压器体系结构具有较高的训练复杂性。源代码可在\ url {https://github.com/yehli/imagenetmodel}中获得。
Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each incurs a much smaller computational complexity. However, regional information is typically only achieved at the expense of undesirable information lost owing to down-sampling. In this paper, we propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are then integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.