论文标题
Sepvit:可分离的视觉变压器
SepViT: Separable Vision Transformer
论文作者
论文摘要
视觉变形金刚在一系列视觉任务中目睹了成功的成功。但是,这些变压器通常依靠广泛的计算成本来实现高性能,这是在资源受限设备上部署的繁重的。为了减轻这个问题,我们从深度可分离的卷积中汲取了教训,并模仿其意识形态以设计有效的变压器主链,即可分离的视觉变压器,缩写为Sepvit。 SEPVIT有助于通过深度可分离的自我注意以顺序进行窗口内部和窗户内部的局部全球信息交互。新颖的窗口令牌嵌入和分组的自我注意力被用来计算窗户之间的关注关系,并分别在多个窗口上建立了长距离的视觉互动。对通用视觉基准测试的广泛实验表明,Sepvit可以实现绩效和延迟之间的最新权衡。其中,与具有相似精度相似的SEPVIT在Imagenet-1k分类中达到了84.2%的TOP-1准确性,同时将延迟降低了40%(例如CSWIN)。此外,Sepvit在ADE20K语义分割任务上达到51.0%MIOU,基于视网膜的可可检测任务47.9 AP,49.4盒AP和基于掩模R-CNN的COCO COCO对象对象检测和实例分割任务上的box ap和44.6 mask ap。
Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision benchmarks demonstrate that SepViT can achieve a state-of-the-art trade-off between performance and latency. Among them, SepViT achieves 84.2% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin). Furthermore, SepViT achieves 51.0% mIoU on ADE20K semantic segmentation task, 47.9 AP on the RetinaNet-based COCO detection task, 49.4 box AP and 44.6 mask AP on Mask R-CNN-based COCO object detection and instance segmentation tasks.