基于场景文本的多模式推理图，基于场景文本的细粒度图像分类和检索

论文标题

基于场景文本的多模式推理图，基于场景文本的细粒度图像分类和检索

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

论文作者

Mafla, Andres, Dey, Sounak, Biten, Ali Furkan, Gomez, Lluis, Karatzas, Dimosthenis

论文摘要

自然图像中发现的场景文本实例带有明确的语义信息，可以提供重要的线索来解决各种计算机视觉问题。在本文中，我们专注于以视觉和文本提示的形式利用多模式内容来应对细粒度的图像分类和检索的任务。首先，我们通过使用文本阅读系统从图像中获得文本实例。然后，我们将文本特征与显着图像区域相结合，以利用两个来源携带的互补信息。具体而言，我们采用图形卷积网络来执行多模式推理，并通过学习显着对象和图像中的文本之间的常见语义空间来获得关系增强的特征。通过获得一组增强的视觉和文本功能，该提议的模型在两个不同的任务中大大优于先前的最先进的模型，即细颗粒的分类和图像和饮料瓶数据集中的图像检索。

Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text reading system. Then, we combine textual features with salient image regions to exploit the complementary information carried by the two sources. Specifically, we employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image. By obtaining an enhanced set of visual and textual features, the proposed model greatly outperforms the previous state-of-the-art in two different tasks, fine-grained classification and image retrieval in the Con-Text and Drink Bottle datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题