论文标题
场景演出者:任何级别的语义图像综合
SceneComposer: Any-Level Semantic Image Synthesis
论文作者
论文摘要
我们从任何精确级别的语义布局中提出了一个新的框架,用于有条件的图像合成,从纯文本到具有精确形状的2D语义画布。更具体地说,输入布局由一个或多个具有自由形式文本描述和可调节精度级别的语义区域组成,可以根据所需的可控性设置。该框架自然会以最低级别的水平减少到文本对图像(T2I),而没有形状信息,并且它成为最高级别的分割对图像(S2I)。通过支持两者之间的水平,我们的框架可以灵活地帮助不同的绘图专业知识的用户以及其创意工作流的不同阶段。我们介绍了几种新型技术,以应对这种新设置所带来的挑战,包括用于收集培训数据的管道;精确编码的掩码金字塔和文本特征图表示,共同编码精度级别,语义和组成信息;以及一个多尺度的引导扩散模型,以合成图像。为了评估提出的方法,我们收集了一个测试数据集,其中包含具有不同场景和样式的用户绘制布局。实验结果表明,所提出的方法可以在以给定精度下进行布局后生成高质量的图像,并与现有方法进行了有利的比较。项目页面\ url {https://zengxianyu.github.io/scenec/}
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shapes. More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. By supporting the levels in-between, our framework is flexible in assisting users of different drawing expertise and at different stages of their creative workflow. We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information; and a multi-scale guided diffusion model to synthesize images. To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles. Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods. Project page \url{https://zengxianyu.github.io/scenec/}