使用离散语音表示的多演讲者文本到语音综合的半监督学习

论文标题

使用离散语音表示的多演讲者文本到语音综合的半监督学习

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

论文作者

Tu, Tao, Chen, Yuan-Jui, Liu, Alexander H., Lee, Hung-yi

论文摘要

最近，在许多高质量的语音以及相应的转录可用的情况下，端到端的多扬声器文本到语音（TTS）系统获得了成功。但是，费力的配对数据收集过程使许多机构无法构建多演讲者的TTS系统。在这项工作中，我们为多演讲者TT提出了一种半监督的学习方法。多扬声器TTS模型可以通过拟议的Encoder-Decoder框架从未转录的音频中学习，并具有离散的语音表示。实验结果表明，只有一个小时的配对语音数据，无论配对数据来自多个扬声器或单个扬声器，该模型都可以在不同的声音中产生可理解的语音。我们发现，即使一部分未配对的语音数据嘈杂，该模型也可以从建议的半监督学习方法中受益。此外，我们的分析表明，配对数据的不同说话者特征对半监视TT的有效性有影响。

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, no matter the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semi-supervised TTS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题