论文标题
构成文本对图像合成的无训练结构扩散指南
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
论文作者
论文摘要
大规模扩散模型已在文本对图像合成(T2I)任务上实现了最先进的结果。尽管它们能够产生高质量但创造性的图像,但我们观察到归因结合和组成能力仍然被认为是主要具有挑战性的问题,尤其是在涉及多个对象时。在这项工作中,我们提高了T2I模型的组成技能,特别是更准确的属性结合和更好的图像组成。为此,我们将语言结构与基于基于扩散的T2I模型中操纵跨注意层的可控性能的扩散指导过程结合在一起。我们观察到,跨注意层中的键和值具有与对象布局和内容相关的强烈语义含义。因此,我们可以通过基于语言见解来操纵交叉注意表示,可以更好地保留生成图像中的组成语义。基于稳定的扩散,即SOTA T2I模型,我们的结构化跨意义设计是有效的,不需要其他训练样本。我们在定性和定量结果方面获得了更好的组成技能,从而在头对头用户比较研究中获得了5-8%的优势。最后,我们进行了深入的分析,以揭示不正确图像组成的潜在原因,并证明在生成过程中跨注意层的特性是合理的。
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.