论文标题
本地化与语义:单峰和多峰模型中的视觉表示
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
论文作者
论文摘要
尽管通过视觉和语言预测取得了令人印象深刻的进步,但尚不清楚这种联合学习范式是否可以帮助理解每种单独的方式。在这项工作中,我们通过探测广泛的任务,旨在以细微的方式评估学习表示的质量,对现有视觉和语言模型和仅视觉模型中的视觉表示形式进行比较分析。有趣的是,我们的经验观察表明,视觉和语言模型在诸如对象和属性预测之类的标签预测任务上更好,而仅视觉模型在需要更多局部信息的密集预测任务下更强大。我们希望我们的研究能阐明语言在视觉学习中的作用,并作为各种预审预告片模型的经验指南。代码将在https://github.com/lizw14/visual_probing上发布
Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing a broad range of tasks, aiming to assess the quality of the learned representations in a nuanced manner. Interestingly, our empirical observations suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models. Code will be released at https://github.com/Lizw14/visual_probing