变形金刚引导的卷积神经网络，用于跨视图地理定位

论文标题

变形金刚引导的卷积神经网络，用于跨视图地理定位

Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization

论文作者

Wang, Teng, Fan, Shujuan, Liu, Daikun, Sun, Changyin

论文摘要

地面地理定位是指将地面查询图像与地理标记的空中图像的参考数据库相匹配来定位地面查询图像。由于这两种视图之间的视觉外观和几何配置之间存在巨大的观点差异，这是非常具有挑战性的。在这项工作中，我们提出了一种新型的变压器引导的卷积神经网络（TransGCNN）体系结构，该体系结构将基于CNN的本地特征与基于变压器的全局表示，以增强表示表示。具体而言，我们的传输包括来自输入图像的CNN主链提取特征图，以及从CNN映射的全局上下文的变压器头部建模。特别是，我们的变压器头是一种空间感知的重要性生成器，可选择明显的CNN功能作为最终功能表示。这样的耦合过程使我们能够利用轻量级变压器网络极大地增强了嵌入式功能的歧视能力。此外，我们设计了一个双Branch变压器头网络，以结合来自多尺度窗口的图像功能，以改善全局功能表示的详细信息。在流行基准数据集上进行的广泛实验表明，我们的模型分别达到94.12 \％和84.92 \％的CVUSA和CVACT_VAL的84.92 \％，它的表现优于第二次表现的基线，这具有低于50％的参数和几乎2x较高的帧速率，因此可以实现优先级别的较高的准确率。

Ground-to-aerial geolocalization refers to localizing a ground-level query image by matching it to a reference database of geo-tagged aerial imagery. This is very challenging due to the huge perspective differences in visual appearances and geometric configurations between these two views. In this work, we propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture, which couples CNN-based local features with Transformer-based global representations for enhanced representation learning. Specifically, our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context from the CNN map. In particular, our Transformer head acts as a spatial-aware importance generator to select salient CNN features as the final feature representation. Such a coupling procedure allows us to leverage a lightweight Transformer network to greatly enhance the discriminative capability of the embedded features. Furthermore, we design a dual-branch Transformer head network to combine image features from multi-scale windows in order to improve details of the global feature representation. Extensive experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12\% and 84.92\% on CVUSA and CVACT_val, respectively, which outperforms the second-performing baseline with less than 50% parameters and almost 2x higher frame rate, therefore achieving a preferable accuracy-efficiency tradeoff.

下载PDF全文

下载文献需遵守相关版权规定

论文标题