论文标题
唱歌综合的半监督学习音色
Semi-supervised Learning for Singing Synthesis Timbre
论文作者
论文摘要
我们提出了一个半监督的唱歌合成器,该合成器能够仅从音频数据中学习新的声音,而无需任何注释,例如语音分割。我们的系统是一个编码器模型,具有两个编码器,语言和声学,还有一个(声学)解码器。第一步,使用标有标记的多名数据集以监督方式对系统进行培训。在这里,我们确保两个编码器产生的嵌入都相似,以便以后可以使用具有声学或语言输入特征的模型。为了以无监督的方式学习新的声音,验证的声学编码器用于训练目标歌手的解码器。最后,在推断时,验证的语言编码器与新声音的解码器一起使用,从语言输入中产生声学特征。我们通过听力测试评估我们的系统,并表明结果与以等效监督方法获得的结果相媲美。
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder. In a first step, the system is trained in a supervised manner, using a labelled multi-singer dataset. Here, we ensure that the embeddings produced by both encoders are similar, so that we can later use the model with either acoustic or linguistic input features. To learn a new voice in an unsupervised manner, the pretrained acoustic encoder is used to train a decoder for the target singer. Finally, at inference, the pretrained linguistic encoder is used together with the decoder of the new voice, to produce acoustic features from linguistic input. We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.