论文标题
TTS中的表现力,可变和可控持续时间建模
Expressive, Variable, and Controllable Duration Modelling in TTS
论文作者
论文摘要
随着非注意神经文本到语音系统的兴起,持续时间建模已成为一个重要的研究问题。当前的方法在很大程度上可以追溯到持续时间预测的先前统计参数语音综合技术,这对语音的表达和变异性很差。在本文中,我们提出了两种改善持续时间建模的替代方法。首先,我们提出了一个以措辞为条件的持续时间模型,该模型改善了预测的持续时间,并提供了更好的暂停建模。我们表明,以措辞为条件的持续时间模型可以提高基线持续时间模型的语音自然性。其次,我们还提出了一个称为Cauliflow的多演讲者持续时间模型,该模型使用归一化的流量来预测持续时间更好地匹配复杂的目标持续时间分布。 Cauliflow在自然性方面与我们的其他提议的持续时间模型相同,同时为相同的迅速和可变级别的表现力提供了可变持续时间。最后,我们建议将花椰菜以新颖的方式对综合语音中的起搏和暂停的直观控制来调节。
Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way.