style-heat：通过预训练的stylegan一击高分辨率可编辑的说话面

论文标题

style-heat：通过预训练的stylegan一击高分辨率可编辑的说话面

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

论文作者

Yin, Fei, Zhang, Yong, Cun, Xiaodong, Cao, Mingdeng, Fan, Yanbo, Wang, Xuan, Bai, Qingyan, Wu, Baoyuan, Wang, Jue, Yang, Yujiu

论文摘要

一声说话的面部生成旨在综合由视频或音频段驱动的任意肖像图像的高质量说话的面部视频。一个具有挑战性的质量因素是输出视频的分辨率：更高分辨率传达了更多细节。在这项工作中，我们研究了预训练的样式的潜在特征空间，并发现了一些出色的空间变换属性。观察后，我们探讨了使用预训练的StyleGAN突破训练数据集的分辨率限制的可能性。我们提出了一个基于预先训练的StyleGAN的新型统一框架，该框架可以实现一组强大的功能，即高分辨率视频生成，通过驾驶视频或音频进行分散控制以及灵活的面部编辑。即使训练数据集的分辨率较低，我们的框架首次将合成的说话面的分辨率提高到1024*1024。我们设计了一个基于视频的运动生成模块和一个基于音频的模块，可以单独或共同插入框架中以推动视频生成。预测的运动用于转换stylegan的潜在特征以进行视觉动画。为了补偿转化失真，我们提出了一个校准网络以及域损失以完善特征。此外，我们的框架允许两种面部编辑，即基于3D形态模型的GAN倒置和直观编辑的全局编辑。全面的实验表明，视频质量优越，灵活的可控性和比最先进方法的编辑性。

One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image, driven by a video or an audio segment. One challenging quality factor is the resolution of the output video: higher resolution conveys more details. In this work, we investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties. Upon the observation, we explore the possibility of using a pre-trained StyleGAN to break through the resolution limit of training datasets. We propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities, i.e., high-resolution video generation, disentangled control by driving video or audio, and flexible face editing. Our framework elevates the resolution of the synthesized talking face to 1024*1024 for the first time, even though the training dataset has a lower resolution. We design a video-based motion generation module and an audio-based one, which can be plugged into the framework either individually or jointly to drive the video generation. The predicted motion is used to transform the latent features of StyleGAN for visual animation. To compensate for the transformation distortion, we propose a calibration network as well as a domain loss to refine the features. Moreover, our framework allows two types of facial editing, i.e., global editing via GAN inversion and intuitive editing based on 3D morphable models. Comprehensive experiments show superior video quality, flexible controllability, and editability over state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题