论文标题
场景盖特:基于场景图的共同发音网络,用于文本视觉问题回答
SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering
论文作者
论文摘要
大多数TextVQA方法都集中在简单的变压器编码器的对象,场景文本和问题单词的集成上。但这无法捕获不同方式之间的语义关系。本文为TextVQA提出了一个基于场景图的共发网络(场景盖特),该网络揭示了对象之间的语义关系,光学字符识别(OCR)令牌和疑问词。它是通过基于TextVQA的场景图来实现的,该场景图发现了图像的基本语义。我们创建了一个指导性的注意模块,以捕获语言和视觉之间的模式内相互作用,以作为模式间相互作用的指导。为了对两种方式之间的关系进行明确的教学,我们提出并整合了两个注意模块,即基于场景的基于图形的语义关系 - 意识到的注意力和位置关系 - 意识到的关注。我们在两个基准数据集(Text-VQA和ST-VQA)上进行了广泛的实验。结果表明,由于场景图及其注意力模块,我们的场景盖特方法优于现有的方法。
Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.