Mukea：基于知识的视觉问题的多模式知识提取和积累回答

论文标题

Mukea：基于知识的视觉问题的多模式知识提取和积累回答

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

论文作者

Ding, Yang, Yu, Jing, Liu, Bang, Hu, Yue, Cui, Mingxin, Wu, Qi

论文摘要

基于知识的视觉问题回答需要将外部知识与开放式跨模式场景的理解相关联。现有解决方案的一个局限性是，它们从仅文本知识基础中获取相关知识，这些知识仅包含由一阶谓词或语言描述表达的事实，而缺乏复杂但不可或缺的多模式知识，无法进行视觉理解。如何为VQA场景构建与视觉相关的多模式知识的研究较少。在本文中，我们建议Mukea通过明确的三胞胎来代表多模式知识，以将视觉对象和事实答案与隐式关系相关联。为了弥合异质间隙，我们提出了三个客观损失，以从互补视图中学习三胞胎表示：嵌入结构，拓扑关系和语义空间。通过采用预训练和微调的学习策略，基本和特定领域的多模式知识逐渐积累以进行答案预测。我们在两个具有挑战性的知识重新提出的数据集上分别超过了最先进的3.35％和6.08％：OK-VQA和KRVQA。实验结果证明了与现有知识基础的多模式知识的互补益处，以及我们端到端框架比现有管道方法的优势。该代码可在https://github.com/andersonstra/mukea上找到。

Knowledge-based visual question answering requires the ability of associating external knowledge for open-ended cross-modal scene understanding. One limitation of existing solutions is that they capture relevant knowledge from text-only knowledge bases, which merely contain facts expressed by first-order predicates or language descriptions while lacking complex but indispensable multimodal knowledge for visual understanding. How to construct vision-relevant and explainable multimodal knowledge for the VQA scenario has been less studied. In this paper, we propose MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations. To bridge the heterogeneous gap, we propose three objective losses to learn the triplet representations from complementary views: embedding structure, topological relation and semantic space. By adopting a pre-training and fine-tuning learning strategy, both basic and domain-specific multimodal knowledge are progressively accumulated for answer prediction. We outperform the state-of-the-art by 3.35% and 6.08% respectively on two challenging knowledge-required datasets: OK-VQA and KRVQA. Experimental results prove the complementary benefits of the multimodal knowledge with existing knowledge bases and the advantages of our end-to-end framework over the existing pipeline methods. The code is available at https://github.com/AndersonStra/MuKEA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题