TDASS：多演讲者低资源TTS的目标域适应语音综合框架

论文标题

TDASS：多演讲者低资源TTS的目标域适应语音综合框架

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

论文作者

Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Xiao, Jing

论文摘要

最近，高度要求通过文本到语音（TTS）应用程序合成个性化语音。但是以前的TTS模型需要大量的目标扬声器演讲进行培训。这是一项高成本的任务，很难从目标扬声器中记录很多话语。演讲的数据扩大是一种解决方案，但导致低质量的综合语音问题。提出了一些多演讲者TTS模型来解决该问题。但是每个说话者不平衡的话语数量导致声音相似性问题。我们建议目标域适应语音合成网络（TDASS）来解决这些问题。 TDASS基于TACOTRON2模型的主干，即高质量的TTS模型，TDASS引入了一种自我利益分类器，用于减少非目标影响。此外，分类器添加了具有不同操作的特殊梯度反转层。我们在中国语音语料库上评估了模型，实验表明，在语音质量和语音相似性方面，所提出的方法优于基线方法。

Recently, synthesizing personalized speech by text-to-speech (TTS) application is highly demanded. But the previous TTS models require a mass of target speaker speeches for training. It is a high-cost task, and hard to record lots of utterances from the target speaker. Data augmentation of the speeches is a solution but leads to the low-quality synthesis speech problem. Some multi-speaker TTS models are proposed to address the issue. But the quantity of utterances of each speaker imbalance leads to the voice similarity problem. We propose the Target Domain Adaptation Speech Synthesis Network (TDASS) to address these issues. Based on the backbone of the Tacotron2 model, which is the high-quality TTS model, TDASS introduces a self-interested classifier for reducing the non-target influence. Besides, a special gradient reversal layer with different operations for target and non-target is added to the classifier. We evaluate the model on a Chinese speech corpus, the experiments show the proposed method outperforms the baseline method in terms of voice quality and voice similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题