FastDiff：高质量语音综合的快速条件扩散模型

论文标题

FastDiff：高质量语音综合的快速条件扩散模型

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

论文作者

Huang, Rongjie, Lam, Max W. Y., Wang, Jun, Su, Dan, Yu, Dong, Ren, Yi, Zhao, Zhou

论文摘要

在许多生成任务中，去核扩散概率模型（DDPM）最近在许多生成任务中取得了领先的表现。但是，继承的迭代抽样过程成本阻碍了他们对语音综合的应用。本文提出了FastDiff，这是一种用于高质量语音综合的快速条件扩散模型。 FastDiff采用了一系列具有时间感知的位置变化的卷积，以有效地模拟具有自适应条件的长期时间依赖性。还采用了噪声时间表预测器来减少采样步骤而不牺牲发电质量。基于FastDiff，我们设计了一个端到端文本到语音合成器FastDiff-TTS，该合成器FastDiff-tts生成没有任何中间功能的高保真语音波形（例如MEL-SPECTROGRAM）。我们对FastDiff的评估证明了具有更高质量（MOS 4.28）语音样本的最新结果。此外，FastDiff在V100 GPU上的采样速度比实时速度快58倍，这使得扩散模型实际上是第一次适用于语音合成部署。我们进一步表明，FastDiff概括地概括了未见扬声器的Mel-Spectrogragron反演，而FastDiff-TTS在端到端文本到语音合成中的其他竞争方法优于其他相互竞争的方法。音频样本可在\ url {https://fastdiff.github.io/}上找到。

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题