早期或晚期融合问题：3D对象识别的视觉变压器中有效的RGB-D融合

论文标题

早期或晚期融合问题：3D对象识别的视觉变压器中有效的RGB-D融合

Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition

论文作者

Tziafas, Georgios, Kasaei, Hamidreza

论文摘要

Vision Transformer（VIT）体系结构已经在计算机视觉文献中建立了其位置，但是，RGB-D对象识别的培训VIT仍然是一个研究的话题，仅在最近的文献中通过多种视觉方式预处理的镜头才在最近的文献中观看。这种方法通常是计算密集型的，依靠多个预处理数据集的规模与3D信息保持一致。在这项工作中，我们提出了一个简单而强大的配方，用于在RGB-D域中转移预验证的VIT，以识别3D对象识别，重点是融合RGB和由VIT共同编码的RGB和深度表示。与多模式变压器中的先前作品相比，这里的主要挑战是使用VIT的证明灵活性来捕获下游而不是训练阶段的跨模式相互作用。我们探讨了哪种深度表示在产生的准确性方面更好，并比较了与VIT体系结构内的RGB和深度模式对齐的早期和晚期融合技术。华盛顿RGB-D对象数据集（ROD）的实验结果表明，在这样的RGB-> RGB-D方案中，晚期融合技术比大多数普遍使用的早期融合更好。借助我们的转移基线，Fusion VIT的得分高达95.4％的ROD中的前1位准确性，在此基准测试中获得了新的最先进的结果。我们进一步展示了在合成至真实的视觉适应中，以及在Rod基准测试中使用多模式融合基线的好处，而不是单峰特征提取器，以及在Rod基准测试中的开放式终身学习方案中，我们的模型以前的模型以> 8％的范围优于先前的作品。最后，我们将我们的方法与机器人框架集成在一起，并演示如何在仿真和真实机器人中作为交互式机器人学习方案中的感知实用程序。

The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to align RGB with 3D information. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. Compared to previous works in multimodal Transformers, the key challenge here is to use the attested flexibility of ViTs to capture cross-modal interactions at the downstream and not the pretraining stage. We explore which depth representation is better in terms of resulting accuracy and compare early and late fusion techniques for aligning the RGB and depth modalities within the ViT architecture. Experimental results in the Washington RGB-D Objects dataset (ROD) demonstrate that in such RGB -> RGB-D scenarios, late fusion techniques work better than most popularly employed early fusion. With our transfer baseline, fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving new state-of-the-art results in this benchmark. We further show the benefits of using our multimodal fusion baseline over unimodal feature extractors in a synthetic-to-real visual adaptation as well as in an open-ended lifelong learning scenario in the ROD benchmark, where our model outperforms previous works by a margin of >8%. Finally, we integrate our method with a robot framework and demonstrate how it can serve as a perception utility in an interactive robot learning scenario, both in simulation and with a real robot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题