论文标题
语言模型可以看到:在文本生成中插入视觉控制
Language Models Can See: Plugging Visual Controls in Text Generation
论文作者
论文摘要
可以提示诸如GPT-2/3之类的生成语言模型(LMS)以显着的质量生成文本。虽然它们是为文本宣传的一代而设计的,但仍然是一个悬而未决的问题,即如何以图像之外的文本方式指导生成过程。在这项工作中,我们提出了一个无训练的框架,称为Magic(带有剪辑的图像引导的文本生成),以插入生成过程中的视觉控制,并使LMS能够以零拍的方式执行多模式任务(例如,图像字幕)。魔术是一个简单而高效的插件框架,它直接结合了现成的LM(即GPT-2)和图像匹配模型(即剪辑),用于图像接地文本生成。在解码过程中,魔术通过引入夹子诱导的分数(称为魔术得分)来影响LM的产生,该分数将生成的结果正常于与给定图像的语义相关,同时与先前生成的上下文相干。值得注意的是,所提出的解码方案不涉及任何梯度更新操作,因此在计算上有效。在零拍图像字幕的挑战性任务上,魔术的表现优于最先进的方法,并以近27倍的解码加速度的速度来胜过最先进的方法。魔术是一个灵活的框架,理论上与任何包含图像接地的文本生成任务兼容。在实验中,我们展示了它也能够在图像和文本提示下进行视觉扎根的故事生成。
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.