ProSoSpeech：通过文本到语音进行量化的矢量预训练增强韵律

论文标题

ProSoSpeech：通过文本到语音进行量化的矢量预训练增强韵律

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

论文作者

Ren, Yi, Lei, Ming, Huang, Zhiying, Zhang, Shiliang, Chen, Qian, Yan, Zhijie, Zhao, Zhou

论文摘要

表现力的文本到语音（TTS）最近已成为一个热门研究主题，主要集中于在语音中建模韵律。韵律建模有几个挑战：1）以前的韵律建模作品中使用的提取音高不可避免，这会损害韵律建模； 2）韵律的不同属性（例如，音高，持续时间和能量）彼此依赖并共同产生自然韵律； 3）由于韵律的高变异性和TTS训练的高质量数据量有限，因此韵律的分布无法完全塑造。为了解决这些问题，我们提出了ProSoSpeeCh，它使用量化的潜在媒介来增强韵律，并在大规模的未配对和低质量的文本和语音数据上预先训练。具体而言，我们首先引入了一个单词级别的韵律编码器，该编码器量化了语音的低频频段，并压缩了潜在韵律矢量（LPV）中的韵律属性。然后，我们引入了一个LPV预测指标，该预测指标可以预测LPV给定单词序列。我们在大规模文本和低质量的语音数据上预先培训LPV预测指标，并将其在高质量的TTS数据集上进行微调。最后，我们的模型可以生成以预测LPV为条件的表达性语音。实验结果表明，与基线方法相比，Prospeech可以用更丰富的韵律产生语音。

Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题