与4D卷积Swin Transformer的成本汇总用于几次分段

论文标题

与4D卷积Swin Transformer的成本汇总用于几次分段

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

论文作者

Hong, Sunghwan, Cho, Seokju, Nam, Jisu, Lin, Stephen, Kim, Seungryong

论文摘要

本文提出了一个新颖的成本聚合网络，称为变压器（VAT），称为体积聚集，以进行几次分割。变压器的使用可以通过在全球接收场上的自我注意来使相关图的聚集受益。但是，变压器处理的相关图的令牌化可能是有害的，因为令牌边界处的不连续性会降低令牌边缘附近可用的局部环境，并减少电感偏差。为了解决这个问题，我们提出了一个4D卷积的SWIN变压器，在该变压器中，高维型弹道变压器之前是一系列小内核卷积，这些卷积会给所有像素赋予局部环境并引入卷积电感偏见。另外，我们通过在锥体结构中应用变压器来提高聚合性能，在锥体结构中，在更粗的水平上的聚集指导较细胞的聚集。然后，在查询的外观嵌入中，将变压器输出中的噪声在随后的解码器中过滤。使用此模型，为所有标准基准设置了一个新的最新基准，以几次片段的分段设置。结果表明，增值税还达到了语义通信的最先进的性能，而成本汇总也起着核心作用。

This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.

下载PDF全文

下载文献需遵守相关版权规定

论文标题