剪辑也是一个有效的细分器：用于弱监督语义分段的文本驱动方法

论文标题

剪辑也是一个有效的细分器：用于弱监督语义分段的文本驱动方法

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

论文作者

Lin, Yuqi, Chen, Minghao, Wang, Wenxiao, Wu, Boxi, Li, Ke, Lin, Binbin, Liu, Haifeng, He, Xiaofei

论文摘要

带有图像级标签的弱监督语义细分（WSSS）是一项具有挑战性的任务。主流方法遵循多阶段的框架，并遭受高训练费用。在本文中，我们探讨了对比的语言图像预训练模型（剪辑）的潜力，以仅使用图像级标签和没有进一步培训的不同类别进行本地化。为了有效地从剪辑中生成高质量的分割掩模，我们提出了一种新型的WSSS框架，称为夹子。我们的框架改善了WSS的所有三个阶段，具有特殊的剪辑设计：1）我们将SoftMax函数引入GradCam，并利用夹子的零拍能力抑制由非目标类别和背景引起的混淆。同时，为了充分利用剪辑，我们在WSSS设置下重新探索文本输入，并自定义两种文本驱动策略：基于锐度的及时选择和同义词融合。 2）为了简化CAM精炼的阶段，我们根据剪辑速度中固有的多头自我注意力（MHSA）提出了一个基于类吸引注意力的亲和力（CAA）模块。 3）当用剪辑生成的面具训练最终细分模型时，我们引入了信心引导的损失（CGL），重点是自信区域。我们的夹子在Pascal VOC 2012和MS Coco 2014上实现了SOTA性能，而伪面罩的先前方法仅需10％的时间。代码可从https://github.com/linyq2117/clip-es获得。

Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.

下载PDF全文

下载文献需遵守相关版权规定

论文标题