Transformer-CNN队列：两位学生最好的半监督语义分割

论文标题

Transformer-CNN队列：两位学生最好的半监督语义分割

Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students

论文作者

Zheng, Xu, Luo, Yunhao, Fu, Chong, Liu, Kangcheng, Wang, Lin

论文摘要

半监督语义分割的流行方法主要采用了使用卷积神经网络（CNN）（CNN）的统一网络模型，并实施了模型对应用于输入或模型的扰动的预测。但是，这样的学习范式受到了两个关键局限性：a）学习未标记数据的判别特征； b）从整个图像中学习全球和本地信息。在本文中，我们提出了一种新型的半监督学习方法（SSL）方法，称为变压器-CNN队列（TCC），该方法由两个基于视觉变压器（VIT）的学生组成，另一个是基于CNN的学生。我们的方法巧妙地通过伪标记的未标记数据来纳入预测和异质特征空间上的多级一致性正则化。首先，由于VIT学生的输入是图像贴片，因此特征地图提取了编码至关重要的类统计。为此，我们提出了班级感知功能一致性蒸馏（CFCD），该功能首先利用每个学生作为伪标签的输出，并生成班级感知功能（CF）映射（CF），以在两个学生之间进行知识转移。其次，随着VIT学生对所有层的表现都更加均匀，我们建议一致性感知的交叉蒸馏（CCD）从同类中的像素方面的预测之间传递知识。我们在CityScapes和Pascal VOC 2012数据集上验证了TCC框架，该数据集优于现有的SSL方法。

The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model's predictions over perturbations applied to the inputs or model. However, such a learning paradigm suffers from two critical limitations: a) learning the discriminative features for the unlabeled data; b) learning both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning (SSL) approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo-labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps for knowledge transfer between the two students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation (CCD) to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which outperforms existing SSL methods by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题