论文标题
通过对比度学习在视觉对话中提高跨模式的理解
Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning
论文作者
论文摘要
视觉对话框是一项具有挑战性的视觉语言任务,因为视觉对话代理在对图像内容和对话记录进行推理后需要回答一系列问题。尽管现有的方法试图在视觉对话框中处理跨模式的理解,但基于他们对视觉和文本上下文的理解,它们仍然在对候选人的答案中排名不够。在本文中,我们基于视觉训练预训练模型VD-Bert分析了视觉对话中的跨模式理解,并提出了一种新的方法,以改善视觉对话的跨模式理解,名为ICMU。 ICMU通过根据四向对比度学习区分不同的拉力输入(即拉图像,问题或答案)来增强跨模式的理解。此外,ICMU利用了单转视觉问题回答,以增强视觉对话模型的交叉模式理解,以处理多转视觉上的对话。实验表明,所提出的方法改善了视觉对话模型的跨模式理解,并为Vistial数据集带来了令人满意的增益。
Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the VisDial dataset.