论文标题
弱监督的视觉语义解析
Weakly Supervised Visual Semantic Parsing
论文作者
论文摘要
场景图的生成(SGG)旨在从图像中提取实体,谓词及其语义结构,从而有许多应用程序,例如视觉推理和图像检索,从而深入了解视觉内容。然而,现有的SGG方法需要数百万个手动注释的边界框来进行培训,并且在计算上效率低下,因为它们详尽地处理了所有对象提案以检测谓词。在本文中,我们首先提出了SGG的广义公式,即视觉语义解析,从而解决了实体和谓词识别,并实现了次级性能。然后,我们根据动态的,基于注意力的两分性消息传递框架提出了视觉语义解析网络VSPNET,该框架通过迭代过程共同渗透图节和边缘。此外,我们根据新的图形对齐算法提出了第一个基于图形的弱监督学习框架,该算法可以无限制地注释训练。通过广泛的实验,我们表明,VSPNET的表现要优于弱监督的基线,并且接近完全监督的性能,同时几次。我们公开发布方法的源代码。
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method.