用于视觉问题的多模式融合变压器在遥感中回答

论文标题

用于视觉问题的多模式融合变压器在遥感中回答

Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

论文作者

Siebert, Tim, Clasen, Kai Norman, Ravanbakhsh, Mahdyar, Demir, Begüm

论文摘要

随着新一代的卫星技术，遥感（RS）图像的档案越来越快。为了使每个RS图像的固有信息易于访问，Rs中引入了视觉询问答案（VQA）。 VQA允许用户就RS图像的内容提出通用信息提出一个自由形式的问题。已经表明，输入方式（即图像和文本）的融合对于VQA系统的性能至关重要。当前的大多数融合方法在其融合模块中使用特定于模态的表示，而不是联合表示学习。但是，要发现图像和问题模式之间的基本关系，模型需要学习关节表示形式，而不是简单地组合（例如，串联，添加或乘法）。我们提出了一个基于多模式变压器的体系结构来克服此问题。我们提出的架构由三个主要模块组成：i）用于提取特定于模态特征的特征提取模块； ii）融合模块，它利用了Visualbert模型（VB）的用户定义数量的多模式变压器层； iii）分类模块获得答案。在RSVQAXBEN和RSVQA-LR数据集（由Sentinel-2图像的RGB频段组成）上获得的实验结果证明了VBFusion在Rs中对VQA任务的有效性。为了分析使用其他光谱频段在VQA框架中使用RS图像的复杂内容的描述的重要性，我们将RSVQAXBEN数据集扩展到包括10m和20m空间分辨率的Sentinel-2图像的所有光谱带。

With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题