关系性视觉推理中的动态语言绑定

论文标题

关系性视觉推理中的动态语言绑定

Dynamic Language Binding in Relational Visual Reasoning

论文作者

Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen

论文摘要

我们提出语言结合对象图网络，这是第一个神经推理方法，具有视觉和文本域的动态关系结构，并在视觉询问回答中应用。放宽当前模型所做的共同假设，即该物体预测预先存在并保持静态，被动地对推理过程，我们建议这些动态谓词在整个域边界上扩展，以包括配对的视觉视觉语言对象结合。在我们的方法中，这些上下文化的对象链接是在每个经常性推理步骤中积极发现的，而无需依赖外部谓语先验。这些动态结构反映了有条件的双域对象依赖性，鉴于通过共同考虑推理的不断发展的上下文。这种发现的动态图促进了多步知识组合和改进，从而迭代地推断出最终答案的紧凑表示。该模型的有效性在图像问题回答中证明，证明了主要VQA数据集的良好性能。我们的方法在复杂的提问任务中优于其他方法，其中涉及多个对象关系。图结构有效地有助于训练的进展，因此网络与其他推理模型相比有效地学习。

We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains with applications in visual question answering. Relaxing the common assumption made by current models that the object predicates pre-exist and stay static, passive to the reasoning process, we propose that these dynamic predicates expand across the domain borders to include pair-wise visual-linguistic object binding. In our method, these contextualized object links are actively found within each recurrent reasoning step without relying on external predicative priors. These dynamic structures reflect the conditional dual-domain object dependency given the evolving context of the reasoning through co-attention. Such discovered dynamic graphs facilitate multi-step knowledge combination and refinements that iteratively deduce the compact representation of the final answer. The effectiveness of this model is demonstrated on image question answering demonstrating favorable performance on major VQA datasets. Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved. The graph structure effectively assists the progress of training, and therefore the network learns efficiently compared to other reasoning models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题