论文标题
部分可观测时空混沌系统的无模型预测
Contextual Expressive Text-to-Speech
论文作者
论文摘要
表达文本到语音(TTS)的目的是用高表达性综合自然语音,韵律,情感或音色。以前的大多数研究都试图从给定的样式和情感标签中产生语音,这通过将样式和情绪分类为固定数量的预定类别来过度简化问题。在本文中,我们介绍了一个新的任务设置,即上下文TTS(CTTS)。 CTT的主要思想是,一个人的讲话方式取决于她所处的特定上下文,在该上下文通常可以表示为文本。因此,在CTTS任务中,我们建议利用这种上下文来指导语音综合过程,而不是依靠样式和情感的明确标签。为了完成此任务,我们构建一个合成数据集并开发有效的框架。实验表明,我们的框架可以基于合成数据集和现实世界中的给定上下文产生高质量的表达语音。
The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.