论文标题
SSFORMER:用于语义分割的轻量级变压器
SSformer: A Lightweight Transformer for Semantic Segmentation
论文作者
论文摘要
人们众所周知,与卷积神经网络相比,变压器在语义细分方面的性能更好。然而,最初的视觉变压器可能缺乏当地社区的归纳偏见,并且具有较高的时间复杂性。最近,Swin Transformer通过使用分层体系结构并移动窗口,同时更有效地在各种视觉任务中设置了新记录。但是,由于Swin Transformer是专门为图像分类而设计的,因此它可以在基于密集的预测分段任务上实现次优性能。此外,仅使用现有方法梳理SWIN变压器将导致最终分割模型的模型大小和参数的提升。在本文中,我们重新考虑了Swin Transformer进行语义分割,并设计了一种称为SSFormer的轻巧但有效的变压器模型。在此模型中,考虑到SWIN Transformer的固有层次设计,我们提出了一个解码器,以从不同层中汇总信息,从而获得局部和全局的注意力。实验结果表明,所提出的SSFormer与最先进的模型产生了可比的MIOU性能,同时保持较小的模型尺寸和较低的计算。
It is well believed that Transformer performs better in semantic segmentation compared to convolutional neural networks. Nevertheless, the original Vision Transformer may lack of inductive biases of local neighborhoods and possess a high time complexity. Recently, Swin Transformer sets a new record in various vision tasks by using hierarchical architecture and shifted windows while being more efficient. However, as Swin Transformer is specifically designed for image classification, it may achieve suboptimal performance on dense prediction-based segmentation task. Further, simply combing Swin Transformer with existing methods would lead to the boost of model size and parameters for the final segmentation model. In this paper, we rethink the Swin Transformer for semantic segmentation, and design a lightweight yet effective transformer model, called SSformer. In this model, considering the inherent hierarchical design of Swin Transformer, we propose a decoder to aggregate information from different layers, thus obtaining both local and global attentions. Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models, while maintaining a smaller model size and lower compute.