论文标题

位置指导的文本提示提示视觉语言预训练

Position-guided Text Prompt for Vision-Language Pre-training

论文作者

Wang, Alex Jinpeng, Zhou, Pan, Shou, Mike Zheng, Yan, Shuicheng

论文摘要

视觉语言预训练(VLP)表现出有望使图像和文本对的有希望的功能,从而促进了各种各样的跨模式学习任务。但是,我们观察到,VLP模型通常缺乏视觉接地/本地化功能,这对于许多下游任务(例如视觉推理)至关重要。在这项工作中,我们提出了一种新颖的位置引导文本提示(PTP)范式,以增强用VLP训练的跨模型模型的视觉接地能力。具体而言,在VLP阶段中,PTP将图像分为$ n \ times n $块,并通过VLP中广泛使用的对象检测器识别每个块中的对象。然后,它通过鼓励模型预测给定块中的对象或回归给定对象的块,例如在APTP中填充`p'或``o''块P具有o”。该机制提高了VLP模型的视觉接地能力,因此可以帮助他们更好地处理各种下游任务。通过将PTP引入几个最新的VLP框架中,我们可以在几个最新的VLP框架中引入跨越的VLP框架。 vilt \ cite {vilt}基线的检索(+4.8)的基线(coco)字幕(+5.3 cider)sota blip \ cite \ cite {blip}基线,ptp可与基于对象的方法和较高的速度相提并论。以\ url {https://github.com/sail-sg/ptp}发布。

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源