有效Vit：高分辨率致密预测的多尺度线性关注

论文标题

有效Vit：高分辨率致密预测的多尺度线性关注

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

论文作者

Cai, Han, Li, Junyan, Hu, Muyan, Gan, Chuang, Han, Song

论文摘要

高分辨率密集的预测使许多具有吸引力的现实应用程序，例如计算摄影，自动驾驶等。但是，巨大的计算成本使得在硬件设备上部署最先进的高分辨率高分辨率密集预测模型变得困难。这项工作提出了有效的效率，这是一个新的高分辨率视觉模型家族，具有新型的多尺度线性关注。与先前依靠大量软件注意力的高分辨率密集预测模型不同，这些模型，硬件可爱的大内奈尔卷积或复杂的拓扑结构以获得良好的表现，我们的多规模线性关注可实现全球接收性领域和多尺度学习（具有高分辨率密集预测的两个值得良好的功能），具有可轻松的高级软件和硬件效率高效的操作。因此，有效的Vit在先前最先进的模型上提供了显着的性能，并在各种硬件平台（包括移动CPU，Edge GPU和Cloud GPU）上具有显着加速。如果没有城市景观的性能损失，我们的有效效率分别提供了高达13.9美元的$ \ times $和6.2 $ \ times $ $ gpu n vpu延迟潜伏期，分别比segformer和segnext。对于超分辨率，有效Vit可在Restormer上提供高达6.4倍的速度，同时在PSNR中提供0.11dB增益。对于任何细分，有效Vit在A100 GPU上提供48.9倍的吞吐量，同时在COCO上实现了零弹射实例分割性能稍好。

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

下载PDF全文

下载文献需遵守相关版权规定

论文标题