视频显着对象检测的深度合并三峰网络

论文标题

视频显着对象检测的深度合并三峰网络

Depth-Cooperated Trimodal Network for Video Salient Object Detection

论文作者

Lu, Yukang, Min, Dingyao, Fu, Keren, Zhao, Qijun

论文摘要

深度可以为显着对象检测（SOD）提供有用的地理线索，并已证明对最近的RGB-D SOD方法有所帮助。但是，现有视频显着对象检测（VSOD）方法仅利用时空信息，很少利用深度信息进行检测。在本文中，我们提出了一个深度合件的三座网络，称为VSOD的DCTNet，这是一项开创性的工作，旨在结合深度信息以帮助VSOD。为此，我们首先从RGB框架中产生深度，然后提出一种方法来不平等地处理这三种方式。具体而言，多模式注意模块（MAM）的设计旨在模拟主模态（RGB）和两个辅助模态（深度，光流）之间的多模式远程依赖关系。我们还引入了一个改进融合模块（RFM），以抑制每种模式中的噪音，并动态选择有用的信息以进行进一步的优化。最后，在精制特征以实现最终的跨模式融合后采用了渐进式融合策略。五个基准数据集的实验证明了我们的深度合并模型与12种最先进方法的优越性，并且还验证了深度的必要性。

Depth can provide useful geographical cues for salient object detection (SOD), and has been proven helpful in recent RGB-D SOD methods. However, existing video salient object detection (VSOD) methods only utilize spatiotemporal information and seldom exploit depth information for detection. In this paper, we propose a depth-cooperated trimodal network, called DCTNet for VSOD, which is a pioneering work to incorporate depth information to assist VSOD. To this end, we first generate depth from RGB frames, and then propose an approach to treat the three modalities unequally. Specifically, a multi-modal attention module (MAM) is designed to model multi-modal long-range dependencies between the main modality (RGB) and the two auxiliary modalities (depth, optical flow). We also introduce a refinement fusion module (RFM) to suppress noises in each modality and select useful information dynamically for further feature refinement. Lastly, a progressive fusion strategy is adopted after the refined features to achieve final cross-modal fusion. Experiments on five benchmark datasets demonstrate the superiority of our depth-cooperated model against 12 state-of-the-art methods, and the necessity of depth is also validated.

下载PDF全文

下载文献需遵守相关版权规定

论文标题