多SEM融合：3D对象检测的多模式语义融合

论文标题

多SEM融合：3D对象检测的多模式语义融合

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

论文作者

Xu, Shaoqing, Li, Fang, Song, Ziying, Fang, Jin, Wang, Sifen, Yang, Zhi-Xin

论文摘要

激光雷达和相机融合技术有望在自动驾驶中实现3D对象检测。大多数多模式3D对象检测框架将语义知识从2D图像集成到3D激光点云中，以提高检测精度。然而，2D特征地图的限制分辨率阻碍了准确的重新投影，并且通常会诱导明显的边界爆破效果，这主要归因于错误的语义分割。为了很好地应对这一限制，我们提出了一个通用的多模式融合框架多模式融合（MSF），以融合来自2D图像和3D点场景分析结果的语义信息。具体而言，我们采用2D/3D语义分割方法来生成2D图像和3D点云的解析结果。具有校准参数的2D语义信息将进一步重新投入到3D点云中。为了处理2D和3D解析结果之间的错位，我们提出了一个基于自适应的融合（AAF）模块，以通过学习自适应融合得分来融合它们。然后，带有熔融语义标签的点云将发送到以下3D对象检测器。此外，我们提出了一个深度特征融合（DFF）模块，以在不同级别上汇总深度特征，以提高最终检测性能。该框架的有效性已在两个公共大规模3D对象检测基准上进行了验证，通过将它们与不同的基线进行比较。实验结果表明，与仅使用点云和仅使用2D语义信息的方法相比，提出的融合策略可以显着改善检测性能。最重要的是，所提出的方法显着优于其他方法，并在Nuscenes测试基准上设定最先进的结果。

LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题