通过潜在对齐方式桥接剪辑和样式的图像编辑

论文标题

通过潜在对齐方式桥接剪辑和样式的图像编辑

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

论文作者

Zheng, Wanfeng, Li, Qiang, Guo, Xiaoyan, Wan, Pengfei, Wang, Zhongyuan

论文摘要

由于提出了视觉语言模型（剪辑），因此开发了文本驱动的图像操纵。先前的工作采用了剪辑来设计基于文本图像一致性的目标来解决此问题。但是，这些方法需要测试时间优化或单模操纵方向的图像特征群集分析。在本文中，我们通过通过潜在对齐（CSLA）架起夹子和stylegan来实现无推理时间优化的不同操纵方向开采。更具体地说，我们的努力包括三个部分：1）一种无数据的培训策略，用于培训潜在映射器，以弥合夹子和StyleGan的潜在空间； 2）为了更精确的映射，提出了时间相对一致性来解决不同潜在空间之间的知识分布偏见问题； 3）为了完善S空间中的潜在映射，还提出了自适应样式混合。使用此映射方案，我们可以实现GAN倒置，文本到图像生成和文本驱动的图像操纵。进行定性和定量比较以证明我们方法的有效性。

Text-driven image manipulation is developed since the vision-language model (CLIP) has been proposed. Previous work has adopted CLIP to design a text-image consistency-based objective to address this issue. However, these methods require either test-time optimization or image feature cluster analysis for single-mode manipulation direction. In this paper, we manage to achieve inference-time optimization-free diverse manipulation direction mining by bridging CLIP and StyleGAN through Latent Alignment (CSLA). More specifically, our efforts consist of three parts: 1) a data-free training strategy to train latent mappers to bridge the latent space of CLIP and StyleGAN; 2) for more precise mapping, temporal relative consistency is proposed to address the knowledge distribution bias problem among different latent spaces; 3) to refine the mapped latent in s space, adaptive style mixing is also proposed. With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation. Qualitative and quantitative comparisons are made to demonstrate the effectiveness of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题