表达语音综合的分层多生成模型

论文标题

表达语音综合的分层多生成模型

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

论文作者

Hono, Yukiya, Tsuboi, Kazuna, Sawada, Kei, Hashimoto, Kei, Oura, Keiichiro, Nankaku, Yoshihiko, Tokuda, Keiichi

论文摘要

本文提出了一个层次生成模型，具有多元素的潜在变量，以综合表达语音。近年来，将细粒度的潜在变量引入文本到语音综合中，从而使韵律和语言综合语音的风格可以很好地控制。但是，当这些潜在变量是通过从标准高斯先验中取样获得的，语音降低的自然性降低。为了解决这个问题，我们提出了一个新颖的框架，以考虑对输入文本的依赖性，分层语言结构以及潜在变量的时间结构，以建模细粒度的潜在变量。该框架由一个多层次的变异自动编码器，有条件的先验和多级自动回归潜在转换器组成，以获取不同的时间分辨率的潜在变量，并通过考虑输入文本来从较旧级别的较高级别的潜在变量采样。实验结果表明，在合成阶段对无参考信号进行采样的适当方法。我们提出的框架还提供了整个话语中口语风格的可控性。

This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by sampling from the standard Gaussian prior. To solve this problem, we propose a novel framework for modeling the fine-grained latent variables, considering the dependence on an input text, a hierarchical linguistic structure, and a temporal structure of latent variables. This framework consists of a multi-grained variational autoencoder, a conditional prior, and a multi-level auto-regressive latent converter to obtain the different time-resolution latent variables and sample the finer-level latent variables from the coarser-level ones by taking into account the input text. Experimental results indicate an appropriate method of sampling fine-grained latent variables without the reference signal at the synthesis stage. Our proposed framework also provides the controllability of speaking style in an entire utterance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题