自我监督的视频表示与运动对抗性感知学习

论文标题

自我监督的视频表示与运动对抗性感知学习

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

论文作者

Liu, Jinyu, Cheng, Ying, Zhang, Yuejie, Zhao, Rui-Wei, Feng, Rui

论文摘要

仅视觉自我监督的学习在视频表示学习方面取得了重大改进。现有相关方法鼓励模型通过利用对比度学习或设计特定的借口任务来学习视频表示。但是，某些模型可能会集中在背景上，这对于学习视频表示并不重要。为了减轻此问题，我们提出了一种称为远程残差框架的新观点，以获取更多运动特定信息。基于此，我们提出了由两个分支组成的运动对抗性感知网络（MCPNET），即运动信息感知（MIP）和对比度实例感知（CIP），通过关注视频中变化领域的重点。具体而言，MIP分支旨在学习细粒度的运动功能，而CIP分支进行对比学习，以学习每个实例的总体语义信息。在两个基准数据集UCF-101和HMDB-51上进行的实验表明，我们的方法的表现优于当前最新的仅视觉自我监督方法。

Visual-only self-supervised learning has achieved significant improvement in video representation learning. Existing related methods encourage models to learn video representations by utilizing contrastive learning or designing specific pretext tasks. However, some models are likely to focus on the background, which is unimportant for learning video representations. To alleviate this problem, we propose a new view called long-range residual frame to obtain more motion-specific information. Based on this, we propose the Motion-Contrastive Perception Network (MCPNet), which consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP), to learn generic video representations by focusing on the changing areas in videos. Specifically, the MIP branch aims to learn fine-grained motion features, and the CIP branch performs contrastive learning to learn overall semantics information for each instance. Experiments on two benchmark datasets UCF-101 and HMDB-51 show that our method outperforms current state-of-the-art visual-only self-supervised approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题