论文标题
RankSeg:与分段的图像类别排名的自适应像素分类
RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation
论文作者
论文摘要
传统上,分割任务是作为完整标签的像素分类任务提出的,可以从所有图像或视频共享的固定数量的预定义语义类别中预测每个像素的类。然而,遵循这种表述,在更现实的设置下,标准体系结构将不可避免地遇到各种挑战,其中类别的范围扩大了(例如,超出1K的级别)。另一方面,在典型的图像或视频中,只有少数类别(即,都存在一小部分。在本文中,我们建议将分割分解为两个子问题:(i)图像级或视频级别的多标签分类和(ii)像素级级别适应性选择标签分类。给定输入图像或视频,我们的框架首先在完整的标签上进行多标签分类,然后对完整的标签进行分类,并根据其类置信度得分选择一个小子集。然后,我们使用等级自适应像素分类器对仅选择的标签进行像素的分类,该标签使用一组面向等级的可学习温度参数来调整像素分类分数。我们的方法在概念上是一般的,可以通过简单地使用轻质多标签分类头和等级适应像素分类器来改善各种现有的分割框架。我们通过四个任务,包括图像语义分割,图像泛型细分,视频实例细分和视频语义分段,展示了框架的有效性。尤其是,借助我们的rankseg,aDE20K全景分段/YouTubevis 2019视频实例分段/VSPW视频语义分段基准分别为ADE20K PANOPTIC SEMEMENTITATION/YOUTUBEVIS上的Mask2Former +0.8%/+0.7%/+0.7%。
The segmentation task has traditionally been formulated as a complete-label pixel classification task to predict a class for each pixel from a fixed number of predefined semantic categories shared by all images or videos. Yet, following this formulation, standard architectures will inevitably encounter various challenges under more realistic settings where the scope of categories scales up (e.g., beyond the level of 1k). On the other hand, in a typical image or video, only a few categories, i.e., a small subset of the complete label are present. Motivated by this intuition, in this paper, we propose to decompose segmentation into two sub-problems: (i) image-level or video-level multi-label classification and (ii) pixel-level rank-adaptive selected-label classification. Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores. We then use a rank-adaptive pixel classifier to perform the pixel-wise classification over only the selected labels, which uses a set of rank-oriented learnable temperature parameters to adjust the pixel classifications scores. Our approach is conceptually general and can be used to improve various existing segmentation frameworks by simply using a lightweight multi-label classification head and rank-adaptive pixel classifier. We demonstrate the effectiveness of our framework with competitive experimental results across four tasks, including image semantic segmentation, image panoptic segmentation, video instance segmentation, and video semantic segmentation. Especially, with our RankSeg, Mask2Former gains +0.8%/+0.7%/+0.7% on ADE20K panoptic segmentation/YouTubeVIS 2019 video instance segmentation/VSPW video semantic segmentation benchmarks respectively.