论文标题
现实世界构图视觉问题回答的可解释的神经计算
Interpretable Neural Computation for Real-World Compositional Visual Question Answering
论文作者
论文摘要
关于视觉问题回答的研究主要有两种主要研究(VQA):具有明确的多跳推理的组成模型,以及在潜在特征空间中具有隐式推理的单片网络。前者在可解释性和组合性方面表现出色,但在现实世界图像上失败,而后者通常由于模型的灵活性和参数效率而实现更好的性能。我们旨在将两者结合起来,以建立一个可解释的框架,以实现现实世界的构图VQA。在我们的框架中,图像和问题被删除到场景图和程序中,符号程序执行人以完全透明的方式在其上运行以选择注意区域,然后迭代地将其传递给视觉语言的预训练的编码器,以预测答案。在GQA基准上进行的实验表明,我们的框架的表现优于先验艺术,并且在整体上的框架上实现了竞争精度。关于有效性,合理性和分布指标,我们的框架超过了其他差距。
There are two main lines of research on visual question answering (VQA): compositional model with explicit multi-hop reasoning, and monolithic network with implicit reasoning in the latent feature space. The former excels in interpretability and compositionality but fails on real-world images, while the latter usually achieves better performance due to model flexibility and parameter efficiency. We aim to combine the two to build an interpretable framework for real-world compositional VQA. In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program executor runs on them with full transparency to select the attention regions, which are then iteratively passed to a visual-linguistic pre-trained encoder to predict answers. Experiments conducted on the GQA benchmark demonstrate that our framework outperforms the compositional prior arts and achieves competitive accuracy among monolithic ones. With respect to the validity, plausibility and distribution metrics, our framework surpasses others by a considerable margin.