论文标题
DENOISPEECH:通过框架级别的噪声建模将文本转换为语音
DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling
论文作者
论文摘要
虽然基于神经的文本到语音(TTS)模型可以综合自然声音,但它们通常需要高质量的语音数据,这是昂贵的。在许多情况下,只有目标发言人的嘈杂演讲,这给该扬声器带来了TTS模型培训的挑战。以前的作品通常使用两种方法来应对挑战:1)使用增强模型使用的语音训练TTS模型; 2)在用嘈杂的语音训练时将单个噪声嵌入作为输入。但是,他们通常无法处理现实世界中复杂的噪音,例如随着时间的流逝差异很大的噪音。在本文中,我们开发了Denoispeech,这是一种TTS系统,可以为具有嘈杂语音数据的说话者综合简洁的语音。在DeOispeech中,我们通过使用噪声条件模块对细粒的框架级噪声进行建模来处理现实世界嘈杂的语音,该噪声噪声与TTS模型共同训练。现实世界数据的实验结果表明,Denoispeech的表现分别优于前两种方法,分别超过0.31和0.66 MOS。
While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.