论文标题
2020年代的convnet
A ConvNet for the 2020s
论文作者
论文摘要
视觉识别的“咆哮20s”始于引入视觉变压器(VIT),后者迅速将Convnets作为最新的图像分类模型取代。另一方面,当应用于对象检测和语义分割等一般计算机视觉任务时,香草VIT面临困难。正是层次变压器(例如,Swin变形金刚)重新引入了几个Convnet先验,使变形金刚几乎可以作为一个通用的视觉骨架可行,并在各种视觉任务上表现出了出色的性能。但是,这种混合方法的有效性仍然在很大程度上归功于变形金刚的内在优势,而不是卷积的固有感应偏见。在这项工作中,我们重新检查了设计空间,并测试纯回合所能达到的限制。我们逐渐将标准重新设置为视觉变压器设计的标准重新设置,并发现几个关键组成部分,这些组件有助于沿途的性能差异。这次探索的结果是一个称为Convnext的纯Convnet模型家族。 Convnext完全由标准的Convnet模块构建,在准确性和可扩展性方面与变压器竞争,达到了87.8%的Imagenet Top-1准确性,并且在可可检测和ADE20K细分方面超过了Swin Transformers的表现,同时保持了标准召唤的简单性和效率。
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.