论文标题
探索图像扩散模型的变压器骨架
Exploring Transformer Backbones for Image Diffusion Models
论文作者
论文摘要
我们提出了一个基于端到端变压器的潜在扩散模型,用于图像合成。在ImageNet类条件生成任务上,我们表明,基于变压器的潜扩散模型达到了14.1FID,与基于UNET的体系结构的13.1FID分数相当。除了显示变压器模型用于基于扩散的图像合成的应用外,此简化既可以轻松融合和建模文本和图像数据。变压器的多头注意机制可以使图像和文本特征之间的简化相互作用消除了基于UNET的扩散模型中交叉说法的要求。
We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the 13.1FID score of a UNet based architecture. In addition to showing the application of Transformer models for Diffusion based image synthesis this simplification in architecture allows easy fusion and modeling of text and image data. The multi-head attention mechanism of Transformers enables simplified interaction between the image and text features which removes the requirement for crossattention mechanism in UNet based Diffusion models.