SOTA视觉模型对自然变化的鲁棒性极限

论文标题

SOTA视觉模型对自然变化的鲁棒性极限

The Robustness Limits of SoTA Vision Models to Natural Variation

论文作者

Ibrahim, Mark, Garrido, Quentin, Morcos, Ari, Bouchacourt, Diane

论文摘要

最近的最先进的视觉模型引入了新的体系结构，学习范式和更大的预处理数据，从而在诸如分类之类的任务上产生了令人印象深刻的性能。虽然前几代视力模型被证明缺乏对诸如姿势之类的因素的鲁棒性，但尚不清楚下一代模型在多大程度上更可靠的程度。为了研究这个问题，我们开发了一个超过700万张图像的数据集，其姿势，位置，背景，照明和大小的变化。我们不仅研究了近期最新模型的鲁棒性，而且还研究模型在训练过程中存在的因素的变化程度。我们考虑了最近视觉模型的目录，包括视觉变形金刚（VIT），蒙版自动编码器（MAE）等自制模型以及在较大数据集（例如剪辑）上训练的模型。我们发现开箱即用，即使今天的最佳型号对于姿势，大小和背景的常见变化都不强大。当一些样本在训练过程中有所不同时，我们发现模型需要大量多样性才能概括 - 尽管最终鲁棒性确实有所改善。但是，当仅在某些课程中看到多样性时，我们发现模型没有推广到其他类别，除非这些课程与训练过程中所见的班级非常相似。我们希望我们的工作能够进一步了解SOTA模型的盲点，并刺激更强大的视觉模型的发展。

Recent state-of-the-art vision models introduced new architectures, learning paradigms, and larger pretraining data, leading to impressive performance on tasks such as classification. While previous generations of vision models were shown to lack robustness to factors such as pose, it's unclear the extent to which this next generation of models are more robust. To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position, background, lighting, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they're present during training. We consider a catalog of recent vision models, including vision transformers (ViT), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find out-of-the-box, even today's best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of diversity to generalize -- though eventually robustness did improve. When diversity is only seen for some classes however, we found models did not generalize to other classes, unless the classes were very similar to those seen varying during training. We hope our work will shed further light on the blind spots of SoTA models and spur the development of more robust vision models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题