论文标题
通过以对象为中心的辅助深度监督增强单眼3D对象检测
Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision
论文作者
论文摘要
单眼3D检测杠杆的最新进展将深度估计网络明确作为3D检测网络的中间阶段。由于在大规模数据集中训练的深度估计网络,深度图方法比其他方法更准确地对象对象。但是,深度图方法可以受到深度图的准确性的限制,并依次使用两个分离的网络进行深度估计,而3D检测显着增加了计算成本和推理时间。在这项工作中,我们提出了一种通过与深度估计任务类似的深度预测损失的共同训练检测网络,以增强基于RGB图像的3D检测器。通过这种方式,我们的3D检测网络可以通过从原始激光雷达点的深度监督(不需要任何人类注释成本)进行监督,以估算准确的深度,而无需明确预测深度图。我们新颖的以对象为中心的深度预测损失集中在前景对象周围的深度上,这对于3D对象检测很重要,以以对象为中心的方式利用像素的深度监督。我们的深度回归模型进一步训练,以预测深度的不确定性,以代表对象的3D置信度。为了有效地使用RAW LIDAR点训练3D检测器并启用端到端培训,我们重新审视了3D对象的回归目标并设计网络体系结构。对Kitti和Nuscenes基准测试的广泛实验表明,我们的方法可以显着提高基于单眼图像的3D检测器,以超越深度图接近,同时保持实时推理速度。
Recent advances in monocular 3D detection leverage a depth estimation network explicitly as an intermediate stage of the 3D detection network. Depth map approaches yield more accurate depth to objects than other methods thanks to the depth estimation network trained on a large-scale dataset. However, depth map approaches can be limited by the accuracy of the depth map, and sequentially using two separated networks for depth estimation and 3D detection significantly increases computation cost and inference time. In this work, we propose a method to boost the RGB image-based 3D detector by jointly training the detection network with a depth prediction loss analogous to the depth estimation task. In this way, our 3D detection network can be supervised by more depth supervision from raw LiDAR points, which does not require any human annotation cost, to estimate accurate depth without explicitly predicting the depth map. Our novel object-centric depth prediction loss focuses on depth around foreground objects, which is important for 3D object detection, to leverage pixel-wise depth supervision in an object-centric manner. Our depth regression model is further trained to predict the uncertainty of depth to represent the 3D confidence of objects. To effectively train the 3D detector with raw LiDAR points and to enable end-to-end training, we revisit the regression target of 3D objects and design a network architecture. Extensive experiments on KITTI and nuScenes benchmarks show that our method can significantly boost the monocular image-based 3D detector to outperform depth map approaches while maintaining the real-time inference speed.