论文标题
检索授权的多模式建模
Retrieval-Augmented Multimodal Language Modeling
论文作者
论文摘要
诸如DALL-E和CM3之类的最新多模型模型在文本对图像和图像到文本生成方面取得了显着进展。但是,这些模型存储了模型参数中的所有学习知识(例如,Eiffel Tower的外观),需要越来越大的模型和培训数据才能捕获更多的知识。为了以更可扩展和模块化的方式整合知识,我们提出了一个检索仪的多模型模型,该模型使基本的多模式模型(Generator)可以参考从外部内存中提取的猎犬获取的相关文本和图像(例如,网络上的文档)。具体来说,对于猎犬,我们使用了验证的夹子,对于发电机,我们在Laion数据集上训练CM3变压器。我们所产生的模型,名为“检索”的CM3(RA-CM3),是第一个可以检索和生成文本和图像的多模式模型。我们表明,在图像和字幕生成任务上,RA-CM3显着优于基线多模型,例如DALL-E和CM3(在MS-Coco上进行12 FID和17个苹果酒的改进),同时需要训练的计算要少得多(占DALL-E的30%)。此外,我们表明RA-CM3具有新颖的功能,例如忠实的图像产生和多模式的内在学习(例如,从演示中产生图像)。
Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).