HCNET：语义分割的分层上下文网络

论文标题

HCNET：语义分割的分层上下文网络

HCNet: Hierarchical Context Network for Semantic Segmentation

论文作者

Chong, Yanwen, Nie, Congchong, Tao, Yulong, Chen, Xiaoshu, Pan, Shaoming

论文摘要

全球上下文信息对于视觉理解问题至关重要，尤其是在像素级的语义细分中。主流方法采用自我注意的机制来模拟全球环境信息。但是，属于不同类别的像素通常具有弱特征相关性。在自我注意力的机制中，对全局像素级相关矩阵进行建模非常多余。为了解决上述问题，我们提出了一个层次上下文网络，以差异地模拟具有较强相关性和异质像素具有较弱相关性的同质像素。具体而言，我们首先提出了一个多尺度的指导前分段模块，以将整个特征映射分为不同的基于基于类别的均质区域。在每个同质区域内，我们设计了像素上下文模块以捕获像素级相关性。随后，与以密集的像素级别的方式建模弱异质相关性的自我发场机制不同，提出了区域上下文模块，以使用每个区域的统一表示形式对稀疏区域级别的依赖性进行建模。通过汇总细粒的像素上下文特征和粗粒区域上下文特征，我们提出的网络不仅可以层次对全局上下文信息进行建模，还可以收获多粒度表示形式，从而更稳定地识别多尺度对象。我们评估了有关城市景观和ISPRS Vaihingen数据集的方法。如果没有铃铛或口哨声，我们的方法在城市景观和ISPRS Vaihingen测试集上实现了82.8％的平均值，总体准确性为91.4％，从而实现了最先进的结果。

Global context information is vital in visual understanding problems, especially in pixel-level semantic segmentation. The mainstream methods adopt the self-attention mechanism to model global context information. However, pixels belonging to different classes usually have weak feature correlation. Modeling the global pixel-level correlation matrix indiscriminately is extremely redundant in the self-attention mechanism. In order to solve the above problem, we propose a hierarchical context network to differentially model homogeneous pixels with strong correlations and heterogeneous pixels with weak correlations. Specifically, we first propose a multi-scale guided pre-segmentation module to divide the entire feature map into different classed-based homogeneous regions. Within each homogeneous region, we design the pixel context module to capture pixel-level correlations. Subsequently, different from the self-attention mechanism that still models weak heterogeneous correlations in a dense pixel-level manner, the region context module is proposed to model sparse region-level dependencies using a unified representation of each region. Through aggregating fine-grained pixel context features and coarse-grained region context features, our proposed network can not only hierarchically model global context information but also harvest multi-granularity representations to more robustly identify multi-scale objects. We evaluate our approach on Cityscapes and the ISPRS Vaihingen dataset. Without Bells or Whistles, our approach realizes a mean IoU of 82.8% and overall accuracy of 91.4% on Cityscapes and ISPRS Vaihingen test set, achieving state-of-the-art results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题