通过自我监督的学习来克服语言先验，以进行视觉问题回答

论文标题

通过自我监督的学习来克服语言先验，以进行视觉问题回答

Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

论文作者

Zhu, Xi, Mao, Zhendong, Liu, Chunxiao, Zhang, Peng, Wang, Bin, Zhang, Yongdong

论文摘要

大多数视觉问题回答（VQA）模型都遭受语言的先验问题，这是由固有的数据偏见引起的。具体来说，VQA模型倾向于根据高频答案（例如，黄色）忽略图像内容的问题（例如，香蕉是什么颜色？）。现有方法通过创建微妙的模型或引入其他视觉注释来解决此问题，以减少问题依赖性，同时增强图像依赖性。但是，由于数据偏差甚至没有得到缓解，因此它们仍然受到语言的先验问题。在本文中，我们介绍了一个自我监督的学习框架来解决这个问题。具体而言，我们首先自动生成标记的数据以平衡有偏见的数据，并提出一个自制的辅助任务来利用平衡数据来帮助基本的VQA模型克服语言先验。我们的方法可以通过生成平衡数据而无需引入外部注释来弥补数据偏差。实验结果表明，我们的方法可以显着胜过最先进的方法，在最常用的基准VQA-CP V2上，总体准确性从49.50％提高到57.59％。换句话说，我们可以在不使用外部注释的情况下将基于注释的方法的性能提高16％。

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e.g., what color is the banana?) based on the high-frequency answers (e.g., yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency while strengthening image dependency. However, they are still subject to the language prior problem since the data biases have not been even alleviated. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and propose a self-supervised auxiliary task to utilize the balanced data to assist the base VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method can significantly outperform the state-of-the-art, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题