论文标题
Next-Vit:在现实的工业场景中有效部署的下一代视觉变压器
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
论文作者
论文摘要
由于注意到了复杂的注意机制和模型设计,大多数现有的视觉变压器(VIT)无法在现实的工业部署方案中有效地执行卷积神经网络(CNN),例如张力和coreml。这提出了一个独特的挑战:可以设计视觉神经网络的速度与CNN一样快,并且表现强大,与VIT一样强大?最近的作品试图设计CNN-Transformer混合体系结构来解决此问题,但这些作品的整体性能远非令人满意。为了结束这些结束,我们提出了下一代视觉变压器,以在现实的工业场景中有效部署,即下一步,这是从延迟/准确性权衡的角度来统治CNN和VIT的。在这项工作中,下一个卷积块(NCB)和下一个变压器块(NTB)分别开发出来,以使用部署友好的机制捕获本地和全球信息。然后,下一个混合策略(NHS)旨在将NCB和NTB堆叠在有效的混合范式中,从而提高了各种下游任务中的性能。广泛的实验表明,在各种视觉任务上,下一步大大优于现有的CNN,VIT和CNN-Transformer混合体系结构的现有CNN,VIT和CNN-Transformer混合体系结构。在Tensorrt上,在类似延迟的情况下,Next-Vit在可可检测上超过了可可检测上的RESNET(从40.4到45.9),ADE20K细分的7.7%MIOU(从38.8%到46.5%)。同时,它可以与CSWIN相当地表现,而推理速度则以3.6倍的速度加速。在Coreml上,在类似延迟的情况下,在CoCo检测上,下vit超过了可可检测的4.6 MAP(从42.6到47.2),ADE20K分割的3.5%MIOU(从45.1%到48.6%)。我们的代码和模型公开:https://github.com/bytedance/next-vit
Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT