论文标题
学习稳健场景图的视觉常识
Learning Visual Commonsense for Robust Scene Graph Generation
论文作者
论文摘要
场景图生成模型通过对象和谓词识别来理解场景,但由于野外的感知挑战,很容易犯错。感知错误通常会导致输出场景图中的荒谬组成,这些图形不遵循现实世界的规则和模式,并且可以使用常识知识来纠正。我们提出了第一种从数据自动获取视觉常识的方法,例如负担能力和直观物理学,并以此来提高场景理解的鲁棒性。为此,我们扩展了变压器模型以结合场景图的结构,并在场景图语料库上训练我们的全局本地注意变压器。经过训练后,我们的模型可以应用于任何场景图生成模型并纠正其明显的错误,从而导致语义上更合理的场景图。通过广泛的实验,我们显示了模型比任何其他选择都更好地学习常识,并提高了最先进的场景图生成方法的准确性。
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to improve the robustness of scene understanding. To this end, we extend Transformer models to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our model can be applied on any scene graph generation model and correct its obvious mistakes, resulting in more semantically plausible scene graphs. Through extensive experiments, we show our model learns commonsense better than any alternative, and improves the accuracy of state-of-the-art scene graph generation methods.