Ernie-sat：跨语言多演讲者文本到语音的语音和文本关节预告

论文标题

Ernie-sat：跨语言多演讲者文本到语音的语音和文本关节预告

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

论文作者

Fan, Xiaoran, Pang, Chao, Yuan, Tian, Bai, He, Zheng, Renjie, Zhu, Pengfei, Wang, Shuohuan, Chen, Junkun, Chen, Zeyu, Huang, Liang, Sun, Yu, Wu, Hua

论文摘要

语音表示学习改善了单语言的语音理解和语音综合任务。但是，尚未探索其在跨语言场景中的能力。在本文中，我们扩展了跨语言多演讲者语音综合任务的预训练方法，包括跨语性多演讲者语音克隆和跨语性的多演讲语音编辑。我们提出了一个语音文本预告片框架，在该框架中，我们将频谱图随机掩盖，并给出了语音示例及其转录。通过学习以不同语言重建输入的蒙版部分，我们的模型比基于扬声器的多扬声器TTS方法显示出很大的改进。此外，我们的框架是培训和推断的端到端，而无需任何填补努力。在跨语性的多演讲者语音克隆和跨语言多演讲者语音编辑任务中，我们的实验表明，我们的模型优于基于扬声器的多演讲者TTS方法。

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题