论文标题
何时以及为什么视觉语言模型表现得像词袋,该怎么办?
When and why vision-language models behave like bags-of-words, and what to do about it?
论文作者
论文摘要
尽管在许多下游应用程序中,大型视觉和语言模型(VLM)取得了成功,但尚不清楚它们编码组成信息的很好。在这里,我们创建归因,关系和顺序(ARO)基准,以系统地评估VLMS了解不同类型的关系,属性和顺序的能力。 ARO由视觉基因组归因组成,以测试对对象属性的理解;视觉基因组关系,测试关系理解;以及可可和flickr30k订单,以测试有序灵敏度。 ARO是比以前的组成基准大的数量级,其中超过50,000个测试用例。我们展示了最先进的VLM的关系理解较差,在将对象链接到其属性时可能会失误,并表现出严重缺乏秩序敏感性。在图像和标题中,VLM主要是在具有丰富组成结构的大型数据集上训练和评估。但是,这些数据集的培训还不足以解决缺乏组成理解,并且对这些数据集进行评估未能浮出水面。为了了解为什么出现这些局限性并且在标准测试中没有表示这些局限性,我们将缩小评估和培训程序。我们证明,无需使用构图和订单信息就可以在现有数据集的检索上表现良好。鉴于对对比预处理的优化是针对具有类似快捷方式的数据集检索的优化,因此我们假设这可以解释为什么模型不需要学习代表组成信息。这一发现暗示了一种自然解决方案:构成意识到的硬性挖掘。我们表明,对比度学习的简单实施修改显着改善了需要理解秩序和组成性的任务的绩效。
Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.