及时的学习图像字幕的基于及时的学习

论文标题

及时的学习图像字幕的基于及时的学习

Prompt-based Learning for Unpaired Image Captioning

论文作者

Zhu, Peipei, Wang, Xiao, Zhu, Lin, Sun, Zhenglong, Zheng, Weishi, Wang, Yaowei, Chen, Changwen

论文摘要

已经开发了未配对的图像字幕（UIC），以从未对齐的视觉语言样本对学习图像描述。现有作品通常使用基于强化学习的对抗性学习和视觉概念奖励来解决此任务。但是，这些现有的作品只能学习视觉和语言域中有限的跨域信息，从而限制了UIC的字幕性能。受到视觉语言预训练模型（VL-PTM）成功的启发，我们试图从大型VL-PTMS中推断出有关UIC任务的跨域提示信息。在许多下游多模式任务（包括图像文本检索和视觉问题）中，迅速学习的最新成功也引起了这项研究。在这项工作中，在对抗性学习框架下介绍了语义提示和汇总的视觉特征，以更准确的标题预测。此外，度量提示旨在选择从基本字幕模型获得的高质量伪图像捕获样品，并以迭代方式完善模型。可可和Flickr30k数据集的广泛实验验证了所提出模型的有希望的字幕能力。我们预计提出的基于及时的UIC模型将刺激基于VL-PTMS字幕的新研究。

Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Existing works usually tackle this task using adversarial learning and visual concept reward based on reinforcement learning. However, these existing works were only able to learn limited cross-domain information in vision and language domains, which restrains the captioning performance of UIC. Inspired by the success of Vision-Language Pre-Trained Models (VL-PTMs) in this research, we attempt to infer the cross-domain cue information about a given image from the large VL-PTMs for the UIC task. This research is also motivated by recent successes of prompt learning in many downstream multi-modal tasks, including image-text retrieval and vision question answering. In this work, a semantic prompt is introduced and aggregated with visual features for more accurate caption prediction under the adversarial learning framework. In addition, a metric prompt is designed to select high-quality pseudo image-caption samples obtained from the basic captioning model and refine the model in an iterative manner. Extensive experiments on the COCO and Flickr30K datasets validate the promising captioning ability of the proposed model. We expect that the proposed prompt-based UIC model will stimulate a new line of research for the VL-PTMs based captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题