视觉变压器的调查

论文标题

视觉变压器的调查

A Survey on Visual Transformer

论文作者

Han, Kai, Wang, Yunhe, Chen, Hanting, Chen, Xinghao, Guo, Jianyuan, Liu, Zhenhua, Tang, Yehui, Xiao, An, Xu, Chunjing, Xu, Yixing, Yang, Zhaohui, Zhang, Yiman, Tao, Dacheng

论文摘要

变压器首先应用于自然语言处理领域，主要是基于自我发挥机制的一种深神经网络。由于其强大的代表能力，研究人员正在寻找将变压器应用于计算机视觉任务的方法。在各种视觉基准测试中，基于变压器的模型的性能与其他类型的网络相似或更好，例如卷积和经常性神经网络。鉴于其高性能和对特定视觉的归纳偏见的需求，变形金刚从计算机视觉社区受到了越来越多的关注。在本文中，我们通过将它们分类为不同的任务并分析其优势和缺点来回顾这些视觉变压器模型。我们探索的主要类别包括骨干网络，高/中级视觉，低级视觉和视频处理。我们还包括有效的变压器方法，用于将变压器推入真正的基于设备的应用程序。此外，我们还简要介绍了计算机视觉中的自发机制，因为它是变压器中的基本组件。在本文结束时，我们讨论了挑战，并为视觉变形金刚提供了更多的研究方向。

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题