基于注意机制的认知级场景的理解

论文标题

基于注意机制的认知级场景的理解

Attention Mechanism based Cognition-level Scene Understanding

论文作者

Tang, Xuejiao, Zhang, Wenbin

论文摘要

给定提问图像输入，视觉常识性推理（VCR）模型可以通过相应的理由预测答案，这需要从现实世界中的推论能力。 VCR任务要求利用多源信息以及学习不同级别的理解和广泛的常识知识，这是认知级别的场景理解任务。 VCR任务由于其广泛的应用而引起了研究人员的兴趣，包括视觉问答，自动化的车辆系统和临床决策支持。以前解决VCR任务的方法通常依赖于使用长期依赖关系编码模型的预训练或利用内存。但是，这些方法遭受了缺乏普遍性和长序列丢失信息的困扰。在本文中，我们提出了一个基于平行的注意的认知VCR网络PAVCR，该网络有效地融合了视觉文本信息，并并行编码语义信息，以使模型能够捕获认知级推断的丰富信息。广泛的实验表明，所提出的模型对基准VCR数据集上的现有方法产生了重大改进。此外，提出的模型将直观的解释用于视觉常识推理。

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题