论文标题
使用翻译和旋转组的无监督对象表示学习
Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant VAE
论文作者
论文摘要
在许多成像方式中,感兴趣的对象可能发生在各种位置和姿势中(即受2D或3D中的翻译和旋转约束),但是对象的位置和姿势不会改变其语义(即对象的本质)。也就是说,飞机在卫星图像中的特定位置和旋转,或自然图像中椅子的3D旋转,或者在低温电子显微照片中粒子的旋转,不会改变这些物体的内在性质。在这里,我们考虑了以完全无监督的方式学习对象不变和位置的对象的语义表示的问题。我们通过引入Target-VAE(翻译和旋转组等级自动编码器框架)来解决以前解决此问题的缺点。 Target-Vae结合了三个核心创新:1)旋转和翻译群 - 等级编码器体系结构,2)在潜在旋转,翻译和旋转转换的语义对象表示方面的结构上脱离的分布,这些分布是由近似spatimate neturnemiant equivar equivar equivar equariant网络共同推断出来的。在全面的实验中,我们表明目标VAE在没有显着改善并避免先前方法的病理的情况下学习了分离的表示形式。当对旋转和翻译高度损坏的图像进行训练时,目标VAE学到的语义表示与始终摆姿势的对象所学的图像相似,从而显着改善了语义潜在空间中的聚类。此外,Target-Vae能够执行明显准确的无监督姿势和位置推断。我们预计,例如Target-Vae之类的方法将为未来的对象生成,姿势预测和对象检测提供未来的方法。
In many imaging modalities, objects of interest can occur in a variety of locations and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location and pose of an object does not change its semantics (i.e. the object's essence). That is, the specific location and rotation of an airplane in satellite imagery, or the 3d rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron micrograph, do not change the intrinsic nature of those objects. Here, we consider the problem of learning semantic representations of objects that are invariant to pose and location in a fully unsupervised manner. We address shortcomings in previous approaches to this problem by introducing TARGET-VAE, a translation and rotation group-equivariant variational autoencoder framework. TARGET-VAE combines three core innovations: 1) a rotation and translation group-equivariant encoder architecture, 2) a structurally disentangled distribution over latent rotation, translation, and a rotation-translation-invariant semantic object representation, which are jointly inferred by the approximate inference network, and 3) a spatially equivariant generator network. In comprehensive experiments, we show that TARGET-VAE learns disentangled representations without supervision that significantly improve upon, and avoid the pathologies of, previous methods. When trained on images highly corrupted by rotation and translation, the semantic representations learned by TARGET-VAE are similar to those learned on consistently posed objects, dramatically improving clustering in the semantic latent space. Furthermore, TARGET-VAE is able to perform remarkably accurate unsupervised pose and location inference. We expect methods like TARGET-VAE will underpin future approaches for unsupervised object generation, pose prediction, and object detection.