论文标题
SA-VQA:视觉和语义表示的结构化对齐视觉问题回答
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
论文作者
论文摘要
视觉问题回答(VQA)引起了行业和学术界的广泛关注。作为一项多模式任务,它具有挑战性,因为它不仅需要视觉和文本理解,还需要使跨模式表示的能力。以前的方法广泛采用实体级别的对齐方式,例如视觉区域及其语义标签之间的相关性,或跨越词和对象特征的相互作用。这些尝试旨在改善跨模式表示,同时忽略其内部关系。取而代之的是,我们建议应用结构化的对齐,这些对齐与视觉和文本内容的图表示,旨在捕获视觉和文本方式之间的深入连接。然而,代表和整合结构化对齐的图并不容易。在这项工作中,我们试图通过首先将不同的模态实体转换为顺序节点和邻接图来解决此问题,然后将它们合并为结构化对齐。正如我们的实验结果所证明的那样,这种结构化的对准可以提高推理性能。此外,我们的模型还可以为每个生成的答案提供更好的解释性。所提出的模型在没有任何预处理的情况下优于GQA数据集上的最新方法,并在VQA-V2数据集上击败了非预先使用的最新方法。
Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.