Raytran：来自带有射线跟踪变压器的视频的多个对象的3D姿势估计和形状重建

论文标题

Raytran：来自带有射线跟踪变压器的视频的多个对象的3D姿势估计和形状重建

RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers

论文作者

Tyszkiewicz, Michał J., Maninis, Kevis-Kokitsi, Popov, Stefan, Ferrari, Vittorio

论文摘要

我们为RGB视频提供了基于变压器的神经网络体系结构，用于多对象3D重建。它依靠两种代表其知识的替代方法：作为特征的全局3D网格和一系列特定视图的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识，以显着稀疏注意力矩阵，使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附加一个detr风格的头，以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比，我们的体系结构是单个阶段，端到端可训练，并且可以从整体上推理来自多个视频帧的场景，而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法，在该数据集中，我们的表现都超过了RGB视频的3D对象姿势估算的最新最新方法；（2）一种强大的替代方法，将多视图立体声与RGB-D CAD比对结合在一起。我们计划发布我们的源代码。

We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. We evaluate our method on the challenging Scan2CAD dataset, where we outperform (1) recent state-of-the-art methods for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment. We plan to release our source code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题