groupvit：语义细分来自文本监督

论文标题

groupvit：语义细分来自文本监督

GroupViT: Semantic Segmentation Emerges from Text Supervision

论文作者

Xu, Jiarui, De Mello, Shalini, Liu, Sifei, Byeon, Wonmin, Breuel, Thomas, Kautz, Jan, Wang, Xiaolong

论文摘要

分组和识别是视觉场景理解的重要组成部分，例如，用于对象检测和语义分割。借助端到端的深度学习系统，图像区域的分组通常通过像素级识别标签的自上而下的监督隐式进行。取而代之的是，在本文中，我们建议将分组机制恢复到深层网络中，从而使语义段仅在文本监督下自动出现。我们提出了一个分层视觉变压器（GroupVit），该视觉变压器超出了常规的网格结构表示，并学会将图像区域分组为逐渐更大的任意形状段。我们通过对比损失在大规模图像文本数据集上与文本编码器共同训练小组vit。只有文本监督并且没有任何像素级注释，GroupVit学习将语义区域组合在一起，并以零拍的方式成功地将语义分割的任务转移到语义分割的任务，即，而没有任何进一步的细微调整。它在Pascal VOC 2012上获得了52.3％MIOU的零拍摄精度和Pascal上下文数据集中的22.4％MIOU，并竞争性地表现为需要更高水平的监督水平的最先进的转移学习方法。我们在https://github.com/nvlabs/groupvit上开放代码。

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at https://github.com/NVlabs/GroupViT .

下载PDF全文

下载文献需遵守相关版权规定

论文标题