我们可以从2D Vision Transformer开始解决3D视觉任务吗？

论文标题

我们可以从2D Vision Transformer开始解决3D视觉任务吗？

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

论文作者

Wang, Yi, Fan, Zhiwen, Chen, Tianlong, Fan, Hehe, Wang, Zhangyang

论文摘要

Vision Transformers（VIT）已被证明是有效的，可以通过大规模图像数据集进行培训来解决2D图像理解任务；同时，作为一条单独的曲目，在对3D视觉世界进行建模时，例如体素或点云。但是，随着希望变压器能够成为异质数据的“通用”建模工具的越来越希望，到目前为止，用于2D和3D任务的VIT已经采用了截然不同的架构设计，这些设计几乎是不可传输的。这引起了一个雄心勃勃的问题：我们可以缩小2D和3D VIT体系结构之间的差距吗？作为一项试点研究，本文证明了使用标准的2D VIT体系结构了解3D视觉世界的有吸引力的承诺，仅在输入和输出水平上只有最小的自定义，而不会重新设计管道。为了从其2D同胞构建3D VIT，我们将贴片嵌入和令牌序列“充气”，并配有旨在匹配3D数据几何形状的新位置编码机制。与高度自定义的3D特定于设计相比，所得的“极简主义” 3D VIT（名为Simple3D Former）在流行的3D任务（例如对象分类，点云分割和室内场景检测）上表现出色地表现出色。因此，它可以作为新3D VIT的强大基准。此外，我们注意到，除了科学的好奇心外，追求统一的2d-3d Vit设计具有实际相关性。具体而言，我们证明了Simple3D-Former自然能够从大规模逼真的2D图像（例如Imagenet）中利用预先训练的重量的财富，可以插入以增强3D任务性能“免费”。

Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we "inflate" the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally enables to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in to enhancing the 3D task performance "for free".

下载PDF全文

下载文献需遵守相关版权规定

论文标题