论文标题
RECGRAD:剩余的降级扩散概率模型,用于语音
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
论文作者
论文摘要
由于其强大的能力生成高保真样本,因此在文本到语音(TTS)合成中出现了脱氧扩散概率模型(DDPM)。但是,他们在高维数据空间中的迭代完善过程会导致推理速度缓慢,这限制了它们在实时系统中的应用。以前的工作通过最大程度地减少推理步骤的数量,但以样本质量为代价来探索加速。在这项工作中,为了提高基于DDPM的TTS模型的推理速度,同时达到了高样品质量,我们提出了Resgrad,这是一种轻巧的扩散模型,该模型学会了通过预测模型输出与相应的地面上的残差来完善现有TTS模型(例如FastSpeech 2)的输出谱图。 RESGRAD具有多个优点:1)与DDPM的其他加速方法进行比较,这些方法需要从头开始合成语音,从而通过将生成目标从地面真实的MEL-SPECTROMPROMPRCOMPRCOMPRCOMIN更改更改为残留,从而降低了任务的复杂性,从而导致了更轻量级的模型,从而降低了一个更轻量级的模型,从而成为一个较小的实时因子。 2)在现有的TTS模型的推理过程中使用RESGRAD以插件的方式使用,而无需重新训练此模型。我们验证单扬声器数据集LJSpeech上的重新限制,还有两个具有多个扬声器(库)和高采样率(VCTK)的具有挑战性的数据集。实验结果表明,与DDPM的其他加速方法相比:1)以实时因子测量的相同推理速度可获得更好的样品质量; 2)具有相似的语音质量,将语音综合的速度比基线方法更快地综合了10倍以上。音频样本可在https://resgrad1.github.io/上找到。
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.