Wesinger：具有辅助损失的数据增强歌声综合

论文标题

Wesinger：具有辅助损失的数据增强歌声综合

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

论文作者

Zhang, Zewang, Zheng, Yibin, Li, Xinhui, Lu, Li

论文摘要

在本文中，我们开发了一种新的多弦乐中国神经唱歌语音合成（SVS）系统，名为Wesinger。为了提高合成声音的准确性和自然性，我们设计了几个特定的模块和技术：1）具有多尺度节奏损失和后处理步骤的深度双向LSTM的持续时间模型； 2）类似变压器的声学模型，具有渐进的俯仰加权解码器损失； 3）24 kHz音高感知的LPCNET神经声码器可产生高质量的唱歌波形； 4）一种新型的数据增强方法，具有多手柄预训练，以实现更强的鲁棒性和自然性。据我们所知，Wesinger是第一个同时采用24 kHz LPCNET和多手指预训练的SVS系统。定量和定性评估结果都证明了Wesinger在准确性和自然方面的有效性，而Wesinger在最近的中国公共中国唱歌语料库Opencpop \ footNote上实现了最先进的表现{https://wenet.org.cn/opencpop/}。一些合成的歌手样本可在线获得\ footNote {https://zzw9222cn.github.io/wesinger/}。

In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM-based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. To our knowledge, WeSinger is the first SVS system to adopt 24 kHz LPCNet and multi-singer pre-training simultaneously. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves state-of-the-art performance on the recent public Chinese singing corpus Opencpop\footnote{https://wenet.org.cn/opencpop/}. Some synthesized singing samples are available online\footnote{https://zzw922cn.github.io/wesinger/}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题