通过多任务学习保留噪音的语音转换中的背景声音

论文标题

通过多任务学习保留噪音的语音转换中的背景声音

Preserving background sound in noise-robust voice conversion via multi-task learning

论文作者

Yao, Jixun, Lei, Yi, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Xie, Lei, Li, Hai, Liu, Junhui, Xie, Danming

论文摘要

背景声音是一种有益的艺术形式，有助于在实现语音转换（VC）场景中提供更身临其境的体验。但是，先前有关VC的研究，主要关注清洁的声音，以背景声音对VC罕见。在VC中保存背景声音的关键问题是通过神经分离模型和源分离模型与VC模型之间的级联反匹配不可避免的语音失真。在本文中，我们通过多任务学习提出了一个端到端的框架，该框架依次级联一个源分离（SS）模块，瓶颈功能提取模块和VC模块。具体而言，源分离任务明确考虑了关键阶段信息，并限制了由不完善的分离过程造成的失真。源分离任务，典型的VC任务和统一任务共享一个统一的重建损失，受到联合培训的约束，以减少SS和VC模块之间的不匹配。实验结果表明，我们提出的框架显着胜过基线系统，同时达到了与经过干净数据训练的VC模型的可比质量和扬声器相似性。

Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题