使用量化的细粒VAE和自动回归韵律生成多样的自然文本到语音样本。

论文标题

使用量化的细粒VAE和自动回归韵律生成多样的自然文本到语音样本。

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

论文作者

Sun, Guangzhi, Zhang, Yu, Weiss, Ron J., Cao, Yuan, Zen, Heiga, Rosenberg, Andrew, Ramabhadran, Bhuvana, Wu, Yonghui

论文摘要

具有细粒度潜在特征的最新神经文本到语音（TTS）模型可以精确控制合成语音的韵律。这样的模型通常结合了细粒的变分自动编码器（VAE）结构，在每个输入令牌（例如音素）处提取潜在特征。但是，生成具有标准vae先验的样品经常会导致不自然和不连续的语音，并在令牌之间发生巨大的韵律变化。本文在离散的潜在空间中提出了一个顺序的先验，该空间可以生成更自然的样本。这是通过使用向量量化（VQ）离散潜在特征的，并在结果上分别训练自回归（AR）的先前模型来实现。我们使用听力测试，自动语音识别（ASR）性能的客观指标以及韵律属性的测量来评估该方法。实验结果表明，所提出的模型显着改善了随机样品产生的自然性。此外，初始实验表明，从提出的模型中随机采样可用作数据增强以提高ASR性能。

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech, with dramatic prosodic variation between tokens. This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples. This is accomplished by discretizing the latent features using vector quantization (VQ), and separately training an autoregressive (AR) prior model over the result. We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes. Experimental results show that the proposed model significantly improves the naturalness in random sample generation. Furthermore, initial experiments demonstrate that randomly sampling from the proposed model can be used as data augmentation to improve the ASR performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题