VidConv：现代化的2D Convnet，可高效的视频识别

论文标题

VidConv：现代化的2D Convnet，可高效的视频识别

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

论文作者

Nguyen, Chuong H., Huynh, Su, Nguyen, Vinh, Nguyen, Ngoc

论文摘要

Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this论文，我们采用反向的现代化结构来设计一个新的骨干，尤其是我们的主要目标是用于工业产品部署，例如仅支持标准操作的FPGA董事会，因此我们的网络仅遵守2D卷积。使用（2+1）D和3D卷积的方法，并在两个基准数据集上与VIT实现可比的结果。

Since being introduced in 2020, Vision Transformers (ViT) has been steadily breaking the record for many vision tasks and are often described as ``all-you-need" to replace ConvNet. Despite that, ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In addition, recent research shows that standard ConvNet if redesigned and trained appropriately can compete favorably with ViT in terms of accuracy and scalability. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition. Particularly, our main target is to serve for industrial product deployment, such as FPGA boards in which only standard operations are supported. Therefore, our network simply consists of 2D convolutions, without using any 3D convolution, long-range attention plugin, or Transformer blocks. While being trained with much fewer epochs (5x-10x), our backbone surpasses the methods using (2+1)D and 3D convolution, and achieve comparable results with ViT on two benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题