场景盖特：基于场景图的共同发音网络，用于文本视觉问题回答

论文标题

场景盖特：基于场景图的共同发音网络，用于文本视觉问题回答

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

论文作者

Cao, Feiqi, Luo, Siwen, Nunez, Felipe, Wen, Zean, Poon, Josiah, Han, Caren

论文摘要

大多数TextVQA方法都集中在简单的变压器编码器的对象，场景文本和问题单词的集成上。但这无法捕获不同方式之间的语义关系。本文为TextVQA提出了一个基于场景图的共发网络（场景盖特），该网络揭示了对象之间的语义关系，光学字符识别（OCR）令牌和疑问词。它是通过基于TextVQA的场景图来实现的，该场景图发现了图像的基本语义。我们创建了一个指导性的注意模块，以捕获语言和视觉之间的模式内相互作用，以作为模式间相互作用的指导。为了对两种方式之间的关系进行明确的教学，我们提出并整合了两个注意模块，即基于场景的基于图形的语义关系 - 意识到的注意力和位置关系 - 意识到的关注。我们在两个基准数据集（Text-VQA和ST-VQA）上进行了广泛的实验。结果表明，由于场景图及其注意力模块，我们的场景盖特方法优于现有的方法。

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.

下载PDF全文

下载文献需遵守相关版权规定

论文标题