合理的可能不是忠实的：在视力语言预训练中探测对象幻觉

论文标题

合理的可能不是忠实的：在视力语言预训练中探测对象幻觉

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

论文作者

Dai, Wenliang, Liu, Zihan, Ji, Ziwei, Su, Dan, Fung, Pascale

论文摘要

大规模视觉语言预先训练（VLP）模型在基于视觉信息生成文本时容易幻觉不存在视觉对象。在本文中，我们从三个方面系统地研究对象幻觉问题。首先，我们检查了最新的最先进的VLP模型，表明它们仍然经常幻觉，并且在标准指标（例如苹果酒）上获得更好得分的模型可能会更加不忠。其次，我们研究了VLP中不同类型的图像编码如何影响幻觉，包括基于区域的，基于网格的和基于斑块的幻觉。令人惊讶的是，我们发现基于补丁的功能执行最佳和较小的贴片分辨率会导致对象幻觉的非平凡降低。第三，我们将各种VLP的目标解散，并证明令牌级的图像量文对准和受控生成对于减少幻觉至关重要。基于此，我们提出了一个名为OBJMLM的简单而有效的VLP损失，以进一步减轻对象幻觉。结果表明，在两个基准测试（可可标题）和nocaps的可可标题和nocaps进行测试时，它将对象幻觉降低了17.4％。

Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation).

下载PDF全文

下载文献需遵守相关版权规定

论文标题