无偏见的场景图生成的堆叠混合注意力和小组协作学习

论文标题

无偏见的场景图生成的堆叠混合注意力和小组协作学习

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

论文作者

Dong, Xingning, Gan, Tian, Song, Xuemeng, Wu, Jianlong, Cheng, Yuan, Nie, Liqiang

论文摘要

场景图的生成通常遵循常规编码器decoder管道，旨在首先对给定图像中的视觉内容进行编码，然后将它们解析为紧凑的摘要图。现有的SGG方法通常不仅忽略了视觉和语言之间的方式不足的方式融合，而且由于偏见的关系预测，导致SGG远非实用。为此，在本文中，我们首先提出了一个新颖的堆叠式杂种网络，该网络促进了模式内的改进以及模式间相互作用，以作为编码器。然后，我们设计了一种创新的小组协作学习策略来优化解码器。特别是，基于一个观察到一个分类器的识别能力仅限于一个极为不平衡的数据集，我们首先部署了一组分类器，这些分类器是专家区分不同类别集的分类器，然后从两个方面合作优化它们以促进无偏见的SGG。在VG和GQA数据集上进行的实验表明，我们不仅建立了无偏置度量的新的最先进的，而且与两个基线相比，性能几乎翻了一番。

Scene Graph Generation, which generally follows a regular encoder-decoder pipeline, aims to first encode the visual contents within the given image and then parse them into a compact summary graph. Existing SGG approaches generally not only neglect the insufficient modality fusion between vision and language, but also fail to provide informative predicates due to the biased relationship predictions, leading SGG far from practical. Towards this end, in this paper, we first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction, to serve as the encoder. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder. Particularly, based upon the observation that the recognition capability of one classifier is limited towards an extremely unbalanced dataset, we first deploy a group of classifiers that are expert in distinguishing different subsets of classes, and then cooperatively optimize them from two aspects to promote the unbiased SGG. Experiments conducted on VG and GQA datasets demonstrate that, we not only establish a new state-of-the-art in the unbiased metric, but also nearly double the performance compared with two baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题