论文标题
STS:多视图3D检测的环绕视图的时间立体声
STS: Surround-view Temporal Stereo for Multi-view 3D Detection
论文作者
论文摘要
学习准确的深度对于多视图3D对象检测至关重要。最近的方法主要是从单眼图像中学习深度,由于单眼深度学习的性质不足,这会面临固有的困难。在这项工作中,我们提出了一种新颖的环绕时间立体声(STS)技术,而不是使用唯一的单眼深度方法,而是利用跨时间之间的几何对应关系来促进准确的深度学习。具体来说,我们将自我车辆周围所有相机的视野视为统一的视图,即环绕浏览量,并在其上进行暂时立体声匹配。利用与STS不同框架之间的几何对应关系并与单眼深度结合起来,以产生最终的深度预测。关于Nuscenes的综合实验表明,STS极大地提高了3D检测能力,特别是对于中距离和长距离对象。在带有Resnet-50骨架的BEVDEPTH上,STS分别将MAP和NDS提高2.6%和1.4%。当使用较大的主链和较大的图像分辨率时,观察到一致的改进,证明了其有效性
Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness