无偏见场景图的双分支混合学习网络

论文标题

无偏见场景图的双分支混合学习网络

Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation

论文作者

Zheng, Chaofan, Gao, Lianli, Lyu, Xinyu, Zeng, Pengpeng, Saddik, Abdulmotaleb El, Shen, Heng Tao

论文摘要

当前的场景图生成研究（SGG）着重于解决生成无偏见的场景图的长尾问题。但是，大多数偏见的方法都过度强调了尾巴谓词，并且在整个训练中低估了头部的谓词，从而破坏了头部谓词特征的表示能力。此外，这些头部谓词的受损特征会损害尾巴谓词的学习。实际上，尾巴谓词的推论在很大程度上取决于从头部谓词中学到的一般模式，例如“站在“依赖”。因此，这些偏见的SGG方法既不能在尾巴谓词上实现出色的性能，也不能满足头部的行为。为了解决这个问题，我们提出了一个双分支混合学习网络（DHL），以照顾SGG的头部谓词和尾巴，包括粗粒度的学习分支（CLB）和细粒度的学习分支（FLB）。具体而言，CLB负责学习专业知识和头部谓词的鲁棒特征，而FLB有望预测信息丰富的尾巴谓词。此外，DHL配备了分支课程时间表（BCS），以使两个分支机构一起工作。实验表明，我们的方法在VG和GQA数据集上实现了新的最新性能，并在尾巴谓词和头部的性能之间取决于权衡。此外，对两个下游任务（即图像字幕和句子到刻画检索）进行了广泛的实验，进一步验证了我们方法的概括和实用性。

The current studies of Scene Graph Generation (SGG) focus on solving the long-tailed problem for generating unbiased scene graphs. However, most de-biasing methods overemphasize the tail predicates and underestimate head ones throughout training, thereby wrecking the representation ability of head predicate features. Furthermore, these impaired features from head predicates harm the learning of tail predicates. In fact, the inference of tail predicates heavily depends on the general patterns learned from head ones, e.g., "standing on" depends on "on". Thus, these de-biasing SGG methods can neither achieve excellent performance on tail predicates nor satisfying behaviors on head ones. To address this issue, we propose a Dual-branch Hybrid Learning network (DHL) to take care of both head predicates and tail ones for SGG, including a Coarse-grained Learning Branch (CLB) and a Fine-grained Learning Branch (FLB). Specifically, the CLB is responsible for learning expertise and robust features of head predicates, while the FLB is expected to predict informative tail predicates. Furthermore, DHL is equipped with a Branch Curriculum Schedule (BCS) to make the two branches work well together. Experiments show that our approach achieves a new state-of-the-art performance on VG and GQA datasets and makes a trade-off between the performance of tail predicates and head ones. Moreover, extensive experiments on two downstream tasks (i.e., Image Captioning and Sentence-to-Graph Retrieval) further verify the generalization and practicability of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题