视觉问答的场景图推理

论文标题

视觉问答的场景图推理

Scene Graph Reasoning for Visual Question Answering

论文作者

Hildebrandt, Marcel, Li, Hang, Koner, Rajat, Tresp, Volker, Günnemann, Stephan

论文摘要

视觉问题回答与回答有关图像的自由形式问题有关。由于它需要对问题和将其与图像中存在的各种对象相关联的能力有深入的语言理解，因此这是一项雄心勃勃的任务，并且需要计算机视觉和自然语言处理的技术。我们提出了一种新颖的方法，该方法通过基于对象及其语义和空间关系在场景中执行上下文驱动的顺序推理来处理任务。作为第一步，我们得出了描述图像中对象及其属性和相互关系的场景图。然后，增强剂学会学会自主在提取的场景图上自动导航以生成路径，然后这是衍生答案的基础。我们对具有挑战性的GQA数据集进行了第一项实验研究，并使用手动策划的场景图进行了研究，我们的方法几乎达到了人类绩效的水平。

Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires techniques from both computer vision and natural language processing. We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene. As a first step, we derive a scene graph which describes the objects in the image, as well as their attributes and their mutual relationships. A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers. We conduct a first experimental study on the challenging GQA dataset with manually curated scene graphs, where our method almost reaches the level of human performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题