多级令牌变压器用于弱监督语义分段

论文标题

多级令牌变压器用于弱监督语义分段

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

论文作者

Xu, Lian, Ouyang, Wanli, Bennamoun, Mohammed, Boussaid, Farid, Xu, Dan

论文摘要

本文提出了一个新的基于变压器的框架，以将特定于类的对象定位图作为伪监督语义分割（WSSS）的伪标签。受到标准视觉变压器中一级令牌的区域的启发，可以利用形成类不足的定位图，我们研究变压器模型是否还可以通过在变压器内学习多个类别的多个类别来有效地捕获更歧视的对象定位。为此，我们提出了一个称为MCTFormer的多类代币变压器，该变压器使用多个类令牌来学习类令牌和补丁令牌之间的交互。所提出的MCTFormer可以成功地从对应于不同类令牌的类别到斑点的浓度中成功产生类歧视对象定位图。我们还建议使用斑块级的成对亲和力，该亲和力是从斑块到斑点变压器注意的，以进一步完善本地化图。此外，所提出的框架被证明可以完全补充类激活映射（CAM）方法，从而在Pascal VOC和MS Coco数据集中获得了极好的WSSS结果。这些结果强调了类令牌对WSSS的重要性。

This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题