通过几何指导的内核变压器进行高效且稳健的2d-Bev表示学习

论文标题

通过几何指导的内核变压器进行高效且稳健的2d-Bev表示学习

Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer

论文作者

Chen, Shaoyu, Cheng, Tianheng, Wang, Xinggang, Meng, Wenming, Zhang, Qian, Liu, Wenyu

论文摘要

从周围的摄像机中学习鸟类视图（BEV）表示对于自动驾驶非常重要。在这项工作中，我们提出了一种几何学引导的内核变压器（GKT），这是一种新型的2到Bev表示的学习机制。 GKT利用几何先验来指导变压器专注于歧视区域，并展开内核特征以生成BEV表示。对于快速推断，我们进一步引入了一个查找表（LUT）索引方法，以消除在运行时消除相机的校准参数。 GKT在2080TI GPU上的3090 GPU / $ 45.6 $ fps上的$ 72.3 $ fps运行，对摄像机偏差和预定义的BEV高度非常强大。 GKT在Nuscenes Val设置上实现了最新的实时细分结果，即38.0 miou（1亿$ \ times以1亿美元的感知范围，分辨率为0.50万）。鉴于效率，有效性和鲁棒性，GKT在自动驾驶场景中具有巨大的实践价值，尤其是对于实时运行系统。代码和模型将在\ url {https://github.com/hustvl/gkt}上提供。

Learning Bird's Eye View (BEV) representation from surrounding-view cameras is of great importance for autonomous driving. In this work, we propose a Geometry-guided Kernel Transformer (GKT), a novel 2D-to-BEV representation learning mechanism. GKT leverages the geometric priors to guide the transformer to focus on discriminative regions and unfolds kernel features to generate BEV representation. For fast inference, we further introduce a look-up table (LUT) indexing method to get rid of the camera's calibrated parameters at runtime. GKT can run at $72.3$ FPS on 3090 GPU / $45.6$ FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m$\times$100m perception range at a 0.5m resolution) on the nuScenes val set. Given the efficiency, effectiveness, and robustness, GKT has great practical values in autopilot scenarios, especially for real-time running systems. Code and models will be available at \url{https://github.com/hustvl/GKT}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题