论文标题
自我监督的视频预读量产生了强大的和更与人类一致的视觉表示形式
Self-supervised video pretraining yields robust and more human-aligned visual representations
论文作者
论文摘要
人类通过观察它们如何随着时间的流逝而了解对象和场景的强大表示。然而,除了需要明确理解的特定任务之外,静态图像预处理仍然是学习视觉基础模型的主要范式。我们质疑这一不匹配,并询问视频预处理是否可以产生具有人类感知标志的视觉表示:跨任务的概括,对扰动的稳健性以及与人类判断的一致性。为此,我们提出了一个策划视频的新型程序,并开发了一个对比框架,该框架从其中的复杂转换中学习。这种简单的用于从视频中提取知识的范式(称为Vito)产生的一般表示形式,在图像理解任务上远远超过了视频预处理方法,以及在视频理解任务上的图像预处理方法。此外,与图像,视频和对抗训练相比,VITO表示对天然和合成变形的强大明显更强。最后,Vito的预测与人类判断密切相符,超过了为此目的专门培训的模型。总之,这些结果表明,视频预处理可能是学习视觉世界的统一,健壮和人类一致的表示的简单方法。
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.