论文标题

多语言:多演讲者的文本到Transformer的语音

MultiSpeech: Multi-Speaker Text to Speech with Transformer

论文作者

Chen, Mingjian, Tan, Xu, Ren, Yi, Xu, Jin, Sun, Hao, Zhao, Sheng, Qin, Tao, Liu, Tie-Yan

论文摘要

基于变压器的文本(TTS)模型(例如Transformer tts〜 \ cite {li2019neural},fastspeech〜 \ cite {ren2019fastspeech})表明,培训和推理效率的优势比基于RNN的模型(例如,tacotron〜\ cite fator tonalal and cite and Antalal intalal intator {shen1018n andatural intator cite and Antalal Antalal intatral intatral intator {推理。但是,平行计算在学习变压器中的文本和语音之间的对齐方式时增加了难度,在多演讲者的情况下,具有嘈杂的数据和不同的扬声器,这进一步放大,并阻碍了变压器对于多演讲者TTS的适用性。在本文中,我们开发了一种称为MultoSpeech的强大而高质量的多演讲者TTS系统,并具有多种设计的组件/技术,以改善文本对语音对齐:1)对训练和培训和关注中Encoder-Decoder的重量矩阵的对角线约束; 2)在编码器中嵌入音素上的标准化,以更好地保留位置信息; 3)解码器预内网中的瓶颈,以防止连续的语音框架之间复制。 VCTK和Libritts多演讲者数据集的实验证明了多语言的有效性:1)与基于天真的变压器的TTS相比,它综合了更强大,更优质的多演讲声音; 2)以Mutispeech模型为老师,我们获得了一个强大的多演讲者快速播音模型,其质量几乎为零,同时享受了非常快的推理速度。

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源