论文标题
具有深度语言理解的感性文本对图像扩散模型
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
论文作者
论文摘要
我们提出了成像剂,这是一种具有前所未有的光真实性和深厚语言理解程度的文本对图像扩散模型。成像人基于大型变压器语言模型的力量在理解文本并取决于高保真图像生成中扩散模型的强度。我们的关键发现是,通用的大型语言模型(例如T5)在仅文本语料库中预估计,在编码文本以进行图像合成方面非常有效:图像中的语言模型的大小增加了样本保真度和图像文本对齐,而不是增加图像扩散模型的大小。 Imagen在可可数据集上实现了7.27的新最先进的FID得分,而无需对可可进行培训,而人类评估者发现成像样本与图像文本对齐中的可可数据本身相当。为了更深入地评估文本对图像模型,我们介绍了Drawbench,这是文本对图像模型的全面且具有挑战性的基准。使用Drawbench,我们将影像剂与包括VQ-GAN+夹,潜在扩散模型和DALL-E 2在内的最新方法进行了比较,并发现人类评估者在样本质量和图像文本对齐方面都在并排比较中比其他模型更喜欢成像。有关结果的概述,请参见https://imagen.research.google/。
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.