论文标题
使用数据扩展的低资源表达文本到语音
Low-resource expressive text-to-speech using data augmentation
论文作者
论文摘要
虽然最近的神经文本到语音(TTS)系统的表现非常出色,但它们通常需要以所需的口语方式从目标扬声器阅读中大量录音。在这项工作中,我们提出了一种新颖的三步方法,以规避记录大量目标数据的昂贵操作,以便构建以这种录音15分钟的富有表现力的声音。首先,我们通过语音转换来增强数据,通过利用其他演讲者的讲话风格的录音来增强数据。接下来,我们在可用记录之上使用该合成数据来训练TTS模型。最后,我们调整了该模型以进一步提高质量。我们的评估表明,所提出的变化对合成语音的许多感知方面的非增强模型产生了重大改进。我们在各种演讲者以及单扬声器和多演讲者模型上展示了针对两种样式(新闻报道和对话)的建议方法,这说明了我们方法的鲁棒性。
While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model. Finally, we fine-tune that model to further increase quality. Our evaluations show that the proposed changes bring significant improvements over non-augmented models across many perceived aspects of synthesised speech. We demonstrate the proposed approach on 2 styles (newscaster and conversational), on various speakers, and on both single and multi-speaker models, illustrating the robustness of our approach.