弗里多：复杂场景图像合成的特征金字塔扩散

论文标题

弗里多：复杂场景图像合成的特征金字塔扩散

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

论文作者

Fan, Wan-Cyuan, Chen, Yen-Chun, Chen, Dongdong, Cheng, Yu, Yuan, Lu, Wang, Yu-Chiang Frank

论文摘要

扩散模型（DMS）显示出高质量图像合成的巨大潜力。但是，当涉及到具有复杂场景的图像时，如何正确描述图像全局结构和对象细节仍然是一项艰巨的任务。在本文中，我们提出了弗里多（Frido），这是一种特征金字塔扩散模型，该模型执行了图像合成的多尺度的粗到1个降解过程。我们的模型将输入图像分解为依赖比例的矢量量化特征，然后是用于产生图像输出的粗到细门。在上述多尺度表示阶段，可以进一步利用文本，场景图或图像布局等其他输入条件。因此，还可以将弗里德（Frido）应用于条件或跨模式图像合成。我们对各种无条件和有条件的图像生成任务进行了广泛的实验，从文本到图像综合，布局到图像，场景环形图像到标签形象。更具体地说，我们在五个基准测试中获得了最先进的FID分数，即可可和开放图像的布局至图像，可可和视觉基因组上的场景环形图像以及可可的标签对图像。代码可在https://github.com/davidhalladay/frido上找到。

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at https://github.com/davidhalladay/Frido.

下载PDF全文

下载文献需遵守相关版权规定

论文标题