论文标题
学习自我3D表示作为射线跟踪
Learning Ego 3D Representation as Ray Tracing
论文作者
论文摘要
一个自动驾驶感知模型旨在将3D语义表示从多个相机中汇总到自我汽车的鸟类视图(BEV)坐标框架中,以便将下游策划者接地。现有的感知方法通常依赖于整个场景的容易出错的深度估计,或者学习稀疏的虚拟3D表示,而没有目标几何结构,这两者在性能和/或能力方面仍然有限。在本文中,我们为自我3D表示从任意数量的无约束的摄像机视图中学习提供了一种新颖的端到端体系结构。受射线追踪原理的启发,我们将“想象眼睛”的两极分化网格设计为可学习的自我3D表示,并与3D到2D投影结合使用自适应注意机制来制定学习过程。至关重要的是,该公式允许从2D图像中提取丰富的3D表示,而无需任何深度监督,并且内置的几何结构一致W.R.T.贝夫。尽管具有简单性和多功能性,但对标准BEV视觉任务(例如,基于摄像机的3D对象检测和BEV分割)进行了广泛的实验表明,我们的模型的表现均优于所有最先进的替代方案,并且具有多任务学习的计算效率具有额外的优势。
A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird's-eye-view (BEV) coordinate frame of the ego car in order to ground downstream planner. Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. In this paper, we present a novel end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. Critically, this formulation allows extracting rich 3D representation from 2D images without any depth supervision, and with the built-in geometry structure consistent w.r.t. BEV. Despite its simplicity and versatility, extensive experiments on standard BEV visual tasks (e.g., camera-based 3D object detection and BEV segmentation) show that our model outperforms all state-of-the-art alternatives significantly, with an extra advantage in computational efficiency from multi-task learning.