语言桥梁的时空互动，用于引用视频对象分割

论文标题

语言桥梁的时空互动，用于引用视频对象分割

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

论文作者

Ding, Zihan, Hui, Tianrui, Huang, Junshi, Wei, Xiaoming, Han, Jizhong, Liu, Si

论文摘要

引用视频对象细分旨在预测视频中自然语言表达式引用的对象的前景标签。先前的方法要么取决于3D convnet，要么将其他2D转向器作为编码器，以提取混合时空特征。但是，这些方法由于在解码阶段发生的延迟和隐式时空相互作用而遭受空间错位或虚假分散因素。为了解决这些限制，我们提出了一个语言桥梁的双链转移（LBDT）模块，该模块将语言用作中间桥，以在编码阶段早期完成显式和适应性的时空交互。具体地，在时间编码器中进行了交叉模式的关注，将单词和空间编码器引用以汇总和传递与语言相关的运动和外观信息。此外，我们还提出了在解码阶段中的双侧通道激活（BCA）模块，以进一步降解并通过通道激活突出时空一致的特征。广泛的实验表明，我们的方法在四个流行的基准测试基准上获得了新的最新性能，分别在A2D句子和J-HMDB句子上获得了6.8％和6.9％的绝对AP收益，同时消耗了大约7倍的计算机开销。

Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less computational overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题