Relvit：视觉关系推理的概念引导的视觉变压器

论文标题

Relvit：视觉关系推理的概念引导的视觉变压器

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

论文作者

Ma, Xiaojian, Nie, Weili, Yu, Zhiding, Jiang, Huaizu, Xiao, Chaowei, Zhu, Yuke, Zhu, Song-Chun, Anandkumar, Anima

论文摘要

关于视觉关系的推理对于人类如何解释视觉世界至关重要。对于当前的深度学习算法，该任务仍然具有挑战性，因为它需要共同解决三个关键技术问题：1）识别对象实体及其属性，2）推断实体对之间的语义关系，以及3）将其推广到新颖的对象关系组合，即系统的概括。在这项工作中，我们使用视觉变压器（VIT）作为视觉推理的基础模型，并更好地利用定义为对象实体及其关系的概念来提高VIT的推理能力。具体来说，我们介绍了一种新颖的概念词典，以允许使用概念键在训练时间进行灵活的图像特征检索。该词典实现了两个新的概念引导的辅助任务：1）促进关系推理的全球任务，以及2）促进语义中心对象对应学习的本地任务。为了检查视觉推理模型的系统概括，我们引入了标准HICO和GQA基准测试的系统分裂。我们在原始拆分中显示了所得模型，概念引导的视觉变压器（或简短的相互作用）在原始分裂中明显优于先验方法和GQA的先验方法，在系统拆分中的方法为43％和18％。我们的消融分析还揭示了我们的模型与多个VIT变体的兼容性以及与超参数的鲁棒性。

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题