注意力指导的语义关系解析视觉问题回答

论文标题

注意力指导的语义关系解析视觉问题回答

Attention Guided Semantic Relationship Parsing for Visual Question Answering

论文作者

Farazi, Moshiur, Khan, Salman, Barnes, Nick

论文摘要

人类与语义标签解释了对象间的关系，这些关系表明执行复杂的视觉语言任务所需的高级理解，例如视觉问题回答（VQA）。但是，现有的VQA模型将关系表示为对象级视觉特征的组合，该功能约束模型以在单个域中的对象之间表达对象之间的交互，而该模型正试图求解多模式任务。在本文中，我们提出了一种通用语义关系解析器，该解析器为图像中的每个主题 - 主体对象三胞胎生成语义特征向量，以及一种相互和自我注意力（MSA）机制，该机制学会识别对回答给定问题很重要的关系三重态。为了激发语义关系的重要性，我们展示了具有地面关系三胞胎的甲骨文设置，在该模型中，与最接近的GQA数据集中最接近的最新模型相比，我们的模型获得了约25％的准确性增长。此外，通过我们的语义解析器，我们表明我们的模型优于VQA和GQA数据集上的其他可比方法。

Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform complex Vision-Language tasks such as Visual Question Answering (VQA). However, existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task. In this paper, we propose a general purpose semantic relationship parser which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention (MSA) mechanism that learns to identify relationship triplets that are important to answer the given question. To motivate the significance of semantic relationships, we show an oracle setting with ground-truth relationship triplets, where our model achieves a ~25% accuracy gain over the closest state-of-the-art model on the challenging GQA dataset. Further, with our semantic parser, we show that our model outperforms other comparable approaches on VQA and GQA datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题