论文标题
ProSoSpeech:通过文本到语音进行量化的矢量预训练增强韵律
ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech
论文作者
论文摘要
表现力的文本到语音(TTS)最近已成为一个热门研究主题,主要集中于在语音中建模韵律。韵律建模有几个挑战:1)以前的韵律建模作品中使用的提取音高不可避免,这会损害韵律建模; 2)韵律的不同属性(例如,音高,持续时间和能量)彼此依赖并共同产生自然韵律; 3)由于韵律的高变异性和TTS训练的高质量数据量有限,因此韵律的分布无法完全塑造。为了解决这些问题,我们提出了ProSoSpeeCh,它使用量化的潜在媒介来增强韵律,并在大规模的未配对和低质量的文本和语音数据上预先训练。具体而言,我们首先引入了一个单词级别的韵律编码器,该编码器量化了语音的低频频段,并压缩了潜在韵律矢量(LPV)中的韵律属性。然后,我们引入了一个LPV预测指标,该预测指标可以预测LPV给定单词序列。我们在大规模文本和低质量的语音数据上预先培训LPV预测指标,并将其在高质量的TTS数据集上进行微调。最后,我们的模型可以生成以预测LPV为条件的表达性语音。实验结果表明,与基线方法相比,Prospeech可以用更丰富的韵律产生语音。
Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.