论文标题
剪辑也是一个有效的细分器:用于弱监督语义分段的文本驱动方法
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
论文作者
论文摘要
带有图像级标签的弱监督语义细分(WSSS)是一项具有挑战性的任务。主流方法遵循多阶段的框架,并遭受高训练费用。在本文中,我们探讨了对比的语言图像预训练模型(剪辑)的潜力,以仅使用图像级标签和没有进一步培训的不同类别进行本地化。为了有效地从剪辑中生成高质量的分割掩模,我们提出了一种新型的WSSS框架,称为夹子。我们的框架改善了WSS的所有三个阶段,具有特殊的剪辑设计:1)我们将SoftMax函数引入GradCam,并利用夹子的零拍能力抑制由非目标类别和背景引起的混淆。同时,为了充分利用剪辑,我们在WSSS设置下重新探索文本输入,并自定义两种文本驱动策略:基于锐度的及时选择和同义词融合。 2)为了简化CAM精炼的阶段,我们根据剪辑速度中固有的多头自我注意力(MHSA)提出了一个基于类吸引注意力的亲和力(CAA)模块。 3)当用剪辑生成的面具训练最终细分模型时,我们引入了信心引导的损失(CGL),重点是自信区域。我们的夹子在Pascal VOC 2012和MS Coco 2014上实现了SOTA性能,而伪面罩的先前方法仅需10%的时间。代码可从https://github.com/linyq2117/clip-es获得。
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.