论文标题
基于格子的检索的视觉接地VQA
Visually Grounded VQA by Lattice-based Retrieval
论文作者
论文摘要
视觉问题回答(VQA)系统中的视觉接地(VG)描述了一个系统如何设法将问题及其对相关图像区域的答案进行。具有强VG的系统被认为是可解释的,并提出了改进的场景理解。在过去几年中,VQA的准确性表现令人印象深刻,但对VG性能和评估的明确改善经常在整体准确性改善的道路上取得了后排。这一原因源于VQA系统的学习范式的主要选择,该系统包括在预定的一组答案选项上培训歧视性分类器。 在这项工作中,我们从分类的主要VQA建模范式中断,并从信息检索任务的角度研究VQA。因此,开发的系统将VG直接连接到其核心搜索过程中。我们的系统通过加权,有向的无环图(又称“晶格”)操作,该图是从给定图像的场景图中得出的,并与从问题中提取的区域引用表达式结合使用。 我们对我们的方法进行了详细的分析,并讨论其独特的属性和局限性。我们的方法在多种情况下,在检查系统中达到了最强的VG性能,并且展示了出色的概括能力。
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.