论文标题
natualsPeech:端到端文本与人类水平质量的语音综合
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
论文作者
论文摘要
近年来,言语到语音(TTS)在学术界和行业方面取得了迅速的进步。一些问题自然会出现,即TTS系统是否可以达到人类水平的质量,如何定义/判断该质量以及如何实现它。在本文中,我们通过首先根据主观度量的统计意义来定义人类水平的质量来回答这些问题,并介绍适当的指南来判断它,然后开发一种称为NaturalSpeech的TTS系统,该系统在基准数据集中实现人类水平的质量。具体而言,我们利用差异自动编码器(VAE)进行端到端的文本来发挥波形的产生,具有多个关键模块来增强先验的能力,并降低了语音中后部的复杂性,包括音素前训练,可微分持续时间建模,双向前提/后/后/后验模型以及VAE中的内存机制。对流行LJSpeech数据集的实验评估表明,我们提出的NaturalSpeech在句子级别上达到了人类记录的-0.01 CMO(比较平均意见分数),并且在P级P >> 0.05处Wilcoxon签名的等级测试,这与此数据集合的第一次记录没有统计学上的差异。
Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.