论文标题
9月:迈向可扩展有效的视觉预训练
SEPT: Towards Scalable and Efficient Visual Pre-Training
论文作者
论文摘要
最近,自我监管的预训练范式在利用大规模的未标记数据来改善下游任务绩效方面表现出了巨大的潜力。但是,在实际情况下,增加未标记的预训练数据规模需要过高的计算成本,并面临未经灌输样本的挑战。为了解决这些问题,我们基于一个简单的假设,从数据选择的角度构建了一个特定于任务的自我监督的预训练框架,即预先培训与目标任务相似的未标记的样本可以带来可观的性能提高。在假设的支持下,我们提出了第一个却又新颖的框架,用于可扩展有效的视觉预训练(SEPT)(SEPT),通过引入用于数据选择的检索管道。 9月,首先利用一个自制的预训练模型来提取整个未标记数据集的特征,以进行检索管道初始化。然后,对于特定的目标任务,SEPT基于每个目标实例的特征相似性,从未标记的数据集中检索最相似的样本,以进行预训练。最后,SEPT培训目标模型以所选未标记的样品的自我监督方式进行目标数据填充。通过将用于目标任务的预训练和可用上游数据的规模解耦,SEPT可以实现上游数据集的高可扩展性和高效率的预训练,从而获得了高模型体系结构的灵活性。各种下游任务的结果表明,与ImageNet预训练相比,SEPT可以实现竞争性甚至更好的性能,同时将训练样本的大小减少一个级级,而无需诉诸任何额外的注释。
Recently, the self-supervised pre-training paradigm has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. However, increasing the scale of unlabeled pre-training data in real-world scenarios requires prohibitive computational costs and faces the challenge of uncurated samples. To address these issues, we build a task-specific self-supervised pre-training framework from a data selection perspective based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains. Buttressed by the hypothesis, we propose the first yet novel framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing a retrieval pipeline for data selection. SEPT first leverage a self-supervised pre-trained model to extract the features of the entire unlabeled dataset for retrieval pipeline initialization. Then, for a specific target task, SEPT retrievals the most similar samples from the unlabeled dataset based on feature similarity for each target instance for pre-training. Finally, SEPT pre-trains the target model with the selected unlabeled samples in a self-supervised manner for target data finetuning. By decoupling the scale of pre-training and available upstream data for a target task, SEPT achieves high scalability of the upstream dataset and high efficiency of pre-training, resulting in high model architecture flexibility. Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with ImageNet pre-training while reducing the size of training samples by one magnitude without resorting to any extra annotations.