论文标题

深层变压器的渴望综合数据

Deep Transformers Thirst for Comprehensive-Frequency Data

论文作者

Xia, Rui, Xue, Chao, Deng, Boyu, Wang, Fang, Wang, Jingchao

论文摘要

当前的研究表明,电感偏差(IB)可以改善视力变压器(VIT)的性能。但是,他们同时引入了金字塔结构,以抵消引入IB引起的增量拖分子和参数。这种结构破坏了计算机视觉和自然语言处理(NLP)的统一,并使模型变得复杂。我们研究了一个名为LSRA的NLP模型,该模型介绍了具有无金字塔结构的IB。我们分析了为什么它胜过VIT的原因,发现引入IB会增加每层高频数据的份额,从而“注意”更多信息。结果,头脑注意到更多多样化的信息,显示出更好的性能。为了进一步探索变压器的潜力,我们提出了EIT,该EIT有效地将IB引入了VIT,并在无金字塔结构下进行了新颖的卷积结构。 EIT通过对Imagenet-1K的最新方法(SOTA)方法实现竞争性能,并在具有无金字塔结构的相同规模模型上实现SOTA性能。

Current researches indicate that inductive bias (IB) can improve Vision Transformer (ViT) performance. However, they introduce a pyramid structure concurrently to counteract the incremental FLOPs and parameters caused by introducing IB. This structure destroys the unification of computer vision and natural language processing (NLP) and complicates the model. We study an NLP model called LSRA, which introduces IB with a pyramid-free structure. We analyze why it outperforms ViT, discovering that introducing IB increases the share of high-frequency data in each layer, giving "attention" to more information. As a result, the heads notice more diverse information, showing better performance. To further explore the potential of transformers, we propose EIT, which Efficiently introduces IB to ViT with a novel decreasing convolutional structure under a pyramid-free structure. EIT achieves competitive performance with the state-of-the-art (SOTA) methods on ImageNet-1K and achieves SOTA performance over the same scale models which have the pyramid-free structure.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源