论文标题
超越跨视图图像检索:使用卫星图像高度准确的车辆定位
Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization Using Satellite Image
论文作者
论文摘要
本文通过将地面图像与高架视图卫星地图匹配,解决了车辆安装的摄像机本地化问题。现有方法通常将此问题视为跨视图图像检索,并使用学习的深度特征将地面查询图像与卫星图的分区(例如,小补丁)匹配。通过这些方法,定位精度受卫星图的分配密度的限制(通常是按数米的顺序)。本文偏离了图像检索的传统智慧,提出了一种新的解决方案,可以实现高度准确的本地化。关键思想是将任务提出为构成估计,并通过基于神经网络的优化解决。具体而言,我们设计了一个两分支{CNN},分别从地面和卫星图像中提取可靠的特征。为了弥合巨大的跨视界域间隙,我们求助于基于相对摄像头姿势的几何投影模块,该模块从卫星地图到地面视图。为了最大程度地减少投影功能和观察到的功能之间的差异,我们采用了可区分的Levenberg-Marquardt({lm})模块,以迭代地搜索最佳相机。整个管道都是可区分的,并且端到端运行。对标准自动驾驶汽车定位数据集进行了广泛的实验证实了该方法的优越性。值得注意的是,例如,从40m x 40m的较宽区域内的相机位置进行粗略估计开始,我们的方法迅速降低了新的Kitti Cross-view数据集中的横向位置误差在5m之内。
This paper addresses the problem of vehicle-mounted camera localization by matching a ground-level image with an overhead-view satellite map. Existing methods often treat this problem as cross-view image retrieval, and use learned deep features to match the ground-level query image to a partition (eg, a small patch) of the satellite map. By these methods, the localization accuracy is limited by the partitioning density of the satellite map (often in the order of tens meters). Departing from the conventional wisdom of image retrieval, this paper presents a novel solution that can achieve highly-accurate localization. The key idea is to formulate the task as pose estimation and solve it by neural-net based optimization. Specifically, we design a two-branch {CNN} to extract robust features from the ground and satellite images, respectively. To bridge the vast cross-view domain gap, we resort to a Geometry Projection module that projects features from the satellite map to the ground-view, based on a relative camera pose. Aiming to minimize the differences between the projected features and the observed features, we employ a differentiable Levenberg-Marquardt ({LM}) module to search for the optimal camera pose iteratively. The entire pipeline is differentiable and runs end-to-end. Extensive experiments on standard autonomous vehicle localization datasets have confirmed the superiority of the proposed method. Notably, e.g., starting from a coarse estimate of camera location within a wide region of 40m x 40m, with an 80% likelihood our method quickly reduces the lateral location error to be within 5m on a new KITTI cross-view dataset.