用于遥感图像分类的多模式融合变压器

论文标题

用于遥感图像分类的多模式融合变压器

Multimodal Fusion Transformer for Remote Sensing Image Classification

论文作者

Roy, Swalpa Kumar, Deria, Ankur, Hong, Danfeng, Rasti, Behnood, Plaza, Antonio, Chanussot, Jocelyn

论文摘要

与卷积神经网络（CNN）相比，视觉变压器（VIT）在图像分类任务中一直在趋势。结果，许多研究人员试图将VIT纳入高光谱图像（HSI）分类任务中。为了达到令人满意的性能，接近CNN的性能，变压器需要更少的参数。 VIT和其他类似的变压器使用外部分类（CLS）代币，该令牌是随机初始化的，通常无法概述，而其他多模式数据集的来源（例如光检测和范围（LIDAR）（LIDAR））为通过CLS的方式提供了改善这些模型的潜力。在本文中，我们引入了一个新的多模式融合变压器（MFT）网络，该网络包括用于HSI土地覆盖分类的多头跨贴剂注意（MCROSSPA）。我们的Mcrosspa除了在变压器编码器中的HSI外，还利用了其他互补信息来源，以实现更好的概括。令牌化的概念用于生成CLS和HSI补丁令牌，有助于在简化和分层的特征空间中学习{独特的表示}。在{广泛使用的基准}数据集上进行了广泛的实验{I.E.我们将提出的MFT模型的结果与其他最先进的变压器，经典CNN和常规分类器模型进行了比较。拟议模型所实现的出色性能是由于使用了多头交叉贴片注意力。源代码将通过\ url {https://github.com/ankurderia/mft}公开提供。}。

Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in hyperspectral image (HSI) classification tasks. To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters. ViTs and other similar transformers use an external classification (CLS) token which is randomly initialized and often fails to generalize well, whereas other sources of multimodal datasets, such as light detection and ranging (LiDAR) offer the potential to improve these models by means of a CLS. In this paper, we introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification. Our mCrossPA utilizes other sources of complementary information in addition to the HSI in the transformer encoder to achieve better generalization. The concept of tokenization is used to generate CLS and HSI patch tokens, helping to learn a {distinctive representation} in a reduced and hierarchical feature space. Extensive experiments are carried out on {widely used benchmark} datasets {i.e.,} the University of Houston, Trento, University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. We compare the results of the proposed MFT model with other state-of-the-art transformers, classical CNNs, and conventional classifiers models. The superior performance achieved by the proposed model is due to the use of multihead cross patch attention. The source code will be made available publicly at \url{https://github.com/AnkurDeria/MFT}.}

下载PDF全文

下载文献需遵守相关版权规定

论文标题