早期区域代理的语义分割

论文标题

早期区域代理的语义分割

Semantic Segmentation by Early Region Proxy

论文作者

Zhang, Yifan, Pang, Bo, Lu, Cewu

论文摘要

典型的视觉骨架操纵结构化特征。作为妥协，长期以来一直将语义分割建模为对密集的常规网格的每点预测。在这项工作中，我们提出了一种新颖有效的建模，该建模从将图像解释为可学习区域的镶嵌，每个图像都具有灵活的几何形状，并具有均匀的语义。为了模拟区域的环境，我们通过在区域嵌入中应用多层自我注意来利用变压器以序列到序列的方式编码区域，这些自我注意力是特定区域的代理。现在，使用单个线性分类器在编码区域嵌入的顶部进行语义分割以每个区域预测进行，其中不再需要解码器。所提出的regproxy模型丢弃了普通的笛卡尔特征布局，并纯粹在区域水平上运行。因此，与传统的密集预测方法相比，它表现出最具竞争力的性能效率折衷。例如，在ADE20K上，小型regproxy-S/16的表现优于最佳CNN模型，使用25％参数和4％的计算，而最大的Regproxy-L/16的表现优于52.9MIOU，其表现可使最新的ART占2.1％，而资源较少。代码和模型可在https://github.com/yif-zhang/regionproxy上找到。

Typical vision backbones manipulate structured features. As a compromise, semantic segmentation has long been modeled as per-point prediction on dense regular grids. In this work, we present a novel and efficient modeling that starts from interpreting the image as a tessellation of learnable regions, each of which has flexible geometrics and carries homogeneous semantics. To model region-wise context, we exploit Transformer to encode regions in a sequence-to-sequence manner by applying multi-layer self-attention on the region embeddings, which serve as proxies of specific regions. Semantic segmentation is now carried out as per-region prediction on top of the encoded region embeddings using a single linear classifier, where a decoder is no longer needed. The proposed RegProxy model discards the common Cartesian feature layout and operates purely at region level. Hence, it exhibits the most competitive performance-efficiency trade-off compared with the conventional dense prediction methods. For example, on ADE20K, the small-sized RegProxy-S/16 outperforms the best CNN model using 25% parameters and 4% computation, while the largest RegProxy-L/16 achieves 52.9mIoU which outperforms the state-of-the-art by 2.1% with fewer resources. Codes and models are available at https://github.com/YiF-Zhang/RegionProxy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题