位置指导的文本提示提示视觉语言预训练

论文标题

位置指导的文本提示提示视觉语言预训练

Position-guided Text Prompt for Vision-Language Pre-training

论文作者

Wang, Alex Jinpeng, Zhou, Pan, Shou, Mike Zheng, Yan, Shuicheng

论文摘要

视觉语言预训练（VLP）表现出有望使图像和文本对的有希望的功能，从而促进了各种各样的跨模式学习任务。但是，我们观察到，VLP模型通常缺乏视觉接地/本地化功能，这对于许多下游任务（例如视觉推理）至关重要。在这项工作中，我们提出了一种新颖的位置引导文本提示（PTP）范式，以增强用VLP训练的跨模型模型的视觉接地能力。具体而言，在VLP阶段中，PTP将图像分为$ n \ times n $块，并通过VLP中广泛使用的对象检测器识别每个块中的对象。然后，它通过鼓励模型预测给定块中的对象或回归给定对象的块，例如在APTP中填充`p'或``o''块P具有o”。该机制提高了VLP模型的视觉接地能力，因此可以帮助他们更好地处理各种下游任务。通过将PTP引入几个最新的VLP框架中，我们可以在几个最新的VLP框架中引入跨越的VLP框架。 vilt \ cite {vilt}基线的检索（+4.8）的基线（coco）字幕（+5.3 cider）sota blip \ cite \ cite {blip}基线，ptp可与基于对象的方法和较高的速度相提并论。以\ url {https://github.com/sail-sg/ptp}发布。

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题