论文标题
高保真指导图像合成与潜在扩散模型
High-Fidelity Guided Image Synthesis with Latent Diffusion Models
论文作者
论文摘要
随着文本条件潜在的扩散模型的出现,可控制的图像合成与用户涂鸦的综合已引起了人们的极大关注。当文本提示提供对整体图像语义的控制时,用户涂抹颜色组成。但是,我们注意到,先前在这个方向上的工作遇到了固有的域移位问题,其中生成的输出通常缺乏细节,并且类似于目标域的简单表示。在本文中,我们提出了一个新颖的引导图像综合框架,该框架通过将输出图像建模作为约束优化问题的解决方案来解决此问题。我们表明,虽然计算优化的精确解决方案是不可行的,但可以实现相同的近似值,而仅需要单次通过反向扩散过程。此外,我们表明,通过简单地在输入文本令牌和用户卒中之间定义基于交叉注意的对应关系,用户还可以控制不同涂漆区域的语义,而无需进行任何有条件的培训或填充。人类用户研究结果表明,所提出的方法的总体用户满意度得分超过85.32%。我们的论文项目页面可在https://1jsingh.github.io/gradop上找到。
Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.