论文标题
来自黑暗数据的文本到语音综合,并在循环数据中进行评估
Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection
论文作者
论文摘要
本文提出了一种从黑数据中选择文本到语音(TTS)综合培训数据的方法。 TTS模型通常是对高质量语音语料库进行培训的,这些语音公司花费了很多时间和金钱来收集数据,这使得增加扬声器的变化非常具有挑战性。相比之下,有大量数据的可用性未知(又称“暗数据”),例如YouTube视频。为了利用除TTS Corpora以外的数据,先前的研究已根据声学质量从CORPORA中选择了语音数据。但是,考虑到已经提出了对数据噪声的TTS模型,我们应该根据其重要性作为对给定TTS模型的培训数据选择数据,而不是语音本身的质量。我们的培训和评估循环的方法基于给定TTS模型的合成语音的自动预测质量选择培训数据。使用YouTube数据的评估结果表明,我们的方法的表现优于常规的基于声学质量的方法。
This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a given TTS model. Results of evaluations using YouTube data reveal that our method outperforms the conventional acoustic-quality-based method.