论文标题
通过预测视觉单词袋来学习表示
Learning Representations by Predicting Bags of Visual Words
论文作者
论文摘要
从未标记的数据中学习基于Convnet的图像表示的学习目标。受NLP方法在该领域的成功启发,在这项工作中,我们提出了一种基于空间密集的图像描述的自我监管方法,该方法编码了离散的视觉概念,此处称为视觉单词。为了构建这样的离散表示形式,我们在基于K-均值的词汇上量化了第一个预先训练的自我监管的Convnet的特征图。然后,作为一项自我监督的任务,我们训练另一个convnet,以预测图像的视觉单词的直方图(即,其单词袋表示),作为输入该图像的扰动版本。拟议的特遣队迫使Convnet学习扰动 - 访问和上下文感知的图像功能,可用于下游图像理解任务。我们广泛评估了我们的方法,并证明了非常有力的经验结果,例如,与受监督案例相比,我们的预训练的自我监管的表述更好地在检测任务中更好地转移了检测任务,并且在预训练期间“看不见”的类别的分类。 这还表明,将图像离散化为视觉词的过程可以为图像域中非常强大的自我监管方法提供基础,从而可以从迄今为止非常成功的NLP域与相关方法建立进一步的联系。
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions that encode discrete visual concepts, here called visual words. To build such discrete representations, we quantize the feature maps of a first pre-trained self-supervised convnet, over a k-means based vocabulary. Then, as a self-supervised task, we train another convnet to predict the histogram of visual words of an image (i.e., its Bag-of-Words representation) given as input a perturbed version of that image. The proposed task forces the convnet to learn perturbation-invariant and context-aware image features, useful for downstream image understanding tasks. We extensively evaluate our method and demonstrate very strong empirical results, e.g., our pre-trained self-supervised representations transfer better on detection task and similarly on classification over classes "unseen" during pre-training, when compared to the supervised case. This also shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.