使用数据扩展的跨言扬声器样式转移用于文本到语音

论文标题

使用数据扩展的跨言扬声器样式转移用于文本到语音

Cross-speaker style transfer for text-to-speech using data augmentation

论文作者

Ribeiro, Manuel Sam, Roth, Julian, Comini, Giulia, Huybrechts, Goeric, Gabrys, Adam, Lorenzo-Trueba, Jaime

论文摘要

我们通过语音转换来解决文本到语音（TTS）的跨言式传输转移问题。我们假定来自目标扬声器的中性非表达数据的语料库，并支持来自不同扬声器的对话表达数据。我们的目标是构建具有表现力的TTS系统，同时保留目标扬声器的身份。所提出的方法依靠语音转换来首先从支持表达式扬声器的集合中生成高质量的数据。然后将语音转换的数据与目标扬声器的自然数据合并，并用于训练单扬声器多式TTS系统。我们提供了这种方法有效，灵活且可扩展的证据。使用一个或多个支持扬声器以及可变数量的支持数据评估该方法。我们进一步提供证据表明，这种方法在使用多个支持扬声器时允许说话风格的可控性。最后，我们将提出的技术扩展到7种语言的14位演讲者。结果表明，我们的技术始终从样式相似性方面改善合成样本，同时保留目标扬声器的身份。

We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题