损失重新缩放VQA：从班级不平衡的视图重新访问语言问题

论文标题

损失重新缩放VQA：从班级不平衡的视图重新访问语言问题

Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View

论文作者

Guo, Yangyang, Nie, Liqiang, Cheng, Zhiyong, Tian, Qi, Zhang, Min

论文摘要

最近的研究指出，许多发达的视觉问题回答（VQA）模型受到语言先验问题的严重影响，这是指根据文本问题和答案之间的共发生模式进行预测，而不是推理视觉内容。为了解决这个问题，大多数现有的方法着重于增强视觉功能学习，以减少对VQA模型决策的这种肤浅的文本快捷方式的影响。但是，有限的努力致力于为其固有原因提供明确的解释。因此，它缺乏为研究界以有目的的方式前进的良好指导，从而在克服这个非平凡的问题时导致模型构建困惑。在本文中，我们建议从班级不平衡的观点解释VQA中的语言先验问题。具体而言，我们设计了一种新颖的解释方案，在该方案中，在晚期训练阶段，显然表现出了同一问题类型的错误预测的频繁和稀疏答案。它明确揭示了为什么VQA模型倾向于给定的问题，其正确的答案在训练集中很少。基于此观察结果，我们进一步开发了一种新型的损失重新缩放方法，以根据计算最终损失的训练数据统计数据为每个答案分配不同的权重。我们将方法应用于三个基础线，并在两个VQA-CP基准数据集上的实验结果显然证明了其有效性。此外，我们还为其他计算机视觉任务（例如面部识别和图像分类）证明了类不平衡解释方案的有效性。

Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem, which refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning visual contents. To tackle it, most existing methods focus on enhancing visual feature learning to reduce this superficial textual shortcut influence on VQA model decisions. However, limited effort has been devoted to providing an explicit interpretation for its inherent cause. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity in overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers of the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further develop a novel loss re-scaling approach to assign different weights to each answer based on the training data statistics for computing the final loss. We apply our approach into three baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题