论文标题
视频表示通过识别时间转换来学习
Video Representation Learning by Recognizing Temporal Transformations
论文作者
论文摘要
我们介绍了一种新颖的自我监督学习方法,以学习对运动动态变化有反应的视频的表示。可以从数据中学到我们的表示形式,而无需人类注释,并为诸如动作识别之类的小型数据集对神经网络的培训提供了重大的提升,这些数据集需要准确区分对象的运动。我们通过训练神经网络将视频序列与其时间转换的版本区分开来促进精确的运动学习,而无需人类注释。为了学会区分非平凡动作,转换的设计基于两个原则:1)根据不同幅度的时间扭曲定义运动簇; 2)仅通过观察和分析尽可能多的图像框架才能确保歧视是可行的。因此,我们介绍了以下转换:前后播放,随机帧跳过和统一的框架跳过。我们的实验表明,经过提议的方法表示训练的网络产量表示,并提高了转移性能,以识别UCF101和HMDB51的动作识别。
We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. To learn to distinguish non-trivial motions, the design of the transformations is based on two principles: 1) To define clusters of motions based on time warps of different magnitude; 2) To ensure that the discrimination is feasible only by observing and analyzing as many image frames as possible. Thus, we introduce the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition on UCF101 and HMDB51.