MSR：使自我监督的学习强大以激进的增强

论文标题

MSR：使自我监督的学习强大以激进的增强

MSR: Making Self-supervised learning Robust to Aggressive Augmentations

论文作者

Bai, Yingbin, Yang, Erkun, Wang, Zhaoqing, Du, Yuxuan, Han, Bo, Deng, Cheng, Wang, Dadong, Liu, Tongliang

论文摘要

最新的自我监督学习方法通过对比图像的不同增强观点来学习视觉表示。与监督学习相比，已经引入了更多积极的增强，以进一步改善培训对的多样性。但是，积极的增强可能会扭曲图像的结构，导致严重的语义转移问题，增强同一图像的视图可能不会共享相同的语义，从而降低了传递性能。为了解决这个问题，我们提出了一个新的SSL范式，该范式通过平衡弱和积极增强对的作用来抵消语义转移的影响。具体而言，语义上不一致的对是少数群体，我们将它们视为嘈杂的对。请注意，深度神经网络（DNNS）具有至关重要的记忆效应，DNN倾向于首先记住清洁（多数）示例，然后再与嘈杂（少数族裔）示例过度拟合。因此，我们设定了一个相对较大的重量，以在早期学习阶段积极增强数据对。随着培训的进行，该模型开始过度适合嘈杂的对。因此，我们逐渐减少了积极增强对的重量。这样，我们的方法可以更好地接受积极的增强并消除语义转移问题。实验表明，对于200个时期，我们的模型在ImageNet-1k上具有73.1％的Imagenet-1k上的TOP-1准确性，比BYOL提高了2.5％。此外，实验还表明，学习的表示形式可以很好地转移到各种下游任务。

Most recent self-supervised learning methods learn visual representation by contrasting different augmented views of images. Compared with supervised learning, more aggressive augmentations have been introduced to further improve the diversity of training pairs. However, aggressive augmentations may distort images' structures leading to a severe semantic shift problem that augmented views of the same image may not share the same semantics, thus degrading the transfer performance. To address this problem, we propose a new SSL paradigm, which counteracts the impact of semantic shift by balancing the role of weak and aggressively augmented pairs. Specifically, semantically inconsistent pairs are of minority and we treat them as noisy pairs. Note that deep neural networks (DNNs) have a crucial memorization effect that DNNs tend to first memorize clean (majority) examples before overfitting to noisy (minority) examples. Therefore, we set a relatively large weight for aggressively augmented data pairs at the early learning stage. With the training going on, the model begins to overfit noisy pairs. Accordingly, we gradually reduce the weights of aggressively augmented pairs. In doing so, our method can better embrace the aggressive augmentations and neutralize the semantic shift problem. Experiments show that our model achieves 73.1% top-1 accuracy on ImageNet-1K with ResNet-50 for 200 epochs, which is a 2.5% improvement over BYOL. Moreover, experiments also demonstrate that the learned representations can transfer well for various downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题