简单的无监督对象学习，用于复杂和自然主义的视频

论文标题

简单的无监督对象学习，用于复杂和自然主义的视频

Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

论文作者

Singh, Gautam, Wu, Yi-Fu, Ahn, Sungjin

论文摘要

无监督的以对象为中心的学习旨在代表场景的模块化，组成和因果结构作为一组对象表示形式，从而有望解决传统单矢量表示的许多关键局限性，例如较差的系统概括。尽管近年来取得了许多显着的进步，但在这个方向上最关键的问题之一是，以前的方法仅与简单和合成的场景一起起作用，而与复杂和自然主义的图像或视频无关。在本文中，我们提出了史蒂夫（Steve），这是一个以对象学习为中心学习的模型。我们提出的模型通过证明其对这一研究中毫无前所未有的各种复杂和自然主义视频的有效性来取得了重大进步。有趣的是，这是通过既不为模型架构增加复杂性而实现的，也没有引入新的目标或弱监督。相反，它是通过一个令人惊讶的简单体系结构来实现的，该体系结构使用基于变压器的图像解码器以插槽为条件，学习目标仅仅是为了重建观察结果。与先前的最新面貌相比，我们对各种复杂和自然视频的实验结果显示出显着改善。

Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor introducing a new objective or weak supervision. Rather, it is achieved by a surprisingly simple architecture that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show significant improvements compared to the previous state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题