使用堆叠的条件变分自动编码器和条件生成对抗网络的文本进行图像合成

论文标题

使用堆叠的条件变分自动编码器和条件生成对抗网络的文本进行图像合成

Text to Image Synthesis using Stacked Conditional Variational Autoencoders and Conditional Generative Adversarial Networks

论文作者

Tibebu, Haileleol, Malik, Aadil, De Silva, Varuna

论文摘要

从文本描述中综合逼真的图像是计算机视觉中的主要挑战。当前对图像合成方法的文本未能产生代表文本描述符的高分辨率图像。大多数现有的研究都依赖于生成的对抗网络（GAN）或变化自动编码器（VAE）。甘斯具有产生更清晰的图像的能力，但缺乏产出的多样性，而VAE擅长生产各种产出，但是产生的图像通常是模糊的。考虑到gan和vaes的相对优势，我们提出了一个新的有条件VAE（CVAE）和有条件的GAN（CGAN）网络体系结构，用于合成以文本描述为条件的图像。这项研究使用条件VAE作为初始发电机，以产生文本描述符的高级草图。第一阶段的高级草图输出和文本描述符被用作条件GAN网络的输入。第二阶段GAN产生256x256高分辨率图像。所提出的体系结构受益于调理增强和有条件的GAN网络的残留块，以实现结果。使用CUB和Oxford-102数据集进行了多个实验，并将所提出方法的结果与Stackgan等最新技术进行了比较。实验表明，所提出的方法生成了以文本描述为条件的高分辨率图像，并使用两个数据集基于Inception和Frechet Inception评分产生竞争结果

Synthesizing a realistic image from textual description is a major challenge in computer vision. Current text to image synthesis approaches falls short of producing a highresolution image that represent a text descriptor. Most existing studies rely either on Generative Adversarial Networks (GANs) or Variational Auto Encoders (VAEs). GANs has the capability to produce sharper images but lacks the diversity of outputs, whereas VAEs are good at producing a diverse range of outputs, but the images generated are often blurred. Taking into account the relative advantages of both GANs and VAEs, we proposed a new stacked Conditional VAE (CVAE) and Conditional GAN (CGAN) network architecture for synthesizing images conditioned on a text description. This study uses Conditional VAEs as an initial generator to produce a high-level sketch of the text descriptor. This high-level sketch output from first stage and a text descriptor is used as an input to the conditional GAN network. The second stage GAN produces a 256x256 high resolution image. The proposed architecture benefits from a conditioning augmentation and a residual block on the Conditional GAN network to achieve the results. Multiple experiments were conducted using CUB and Oxford-102 dataset and the result of the proposed approach is compared against state-ofthe-art techniques such as StackGAN. The experiments illustrate that the proposed method generates a high-resolution image conditioned on text descriptions and yield competitive results based on Inception and Frechet Inception Score using both datasets

下载PDF全文

下载文献需遵守相关版权规定

论文标题