卡拉克：与语音训练数据的无对齐的唱歌语音合成

论文标题

卡拉克：与语音训练数据的无对齐的唱歌语音合成

Karaoker: Alignment-free singing voice synthesis with speech training data

论文作者

Kakoulidis, Panos, Ellinas, Nikolaos, Vamvoukakis, Georgios, Markopoulos, Konstantinos, Sung, June Sig, Jho, Gunu, Tsiakoulis, Pirros, Chalamandaris, Aimilios

论文摘要

现有的唱歌语音合成模型（SVS）通常在唱歌数据上进行训练，并取决于容易出错的时间对齐和持续时间功能或明确的音乐得分信息。在本文中，我们提出了Karaoker，Karaoker是一种基于多言式Tacotron的模型，该模型以语音特征为条件，该功能专门针对口语数据训练而无需时间对齐。卡拉克（Karaoker）综合了从看不见的歌手/扬声器的源波形中提取的多维模板后，综合了唱歌的声音和传输样式。该模型在连续数据上以单个深卷积编码为共同条件，包括音高，强度，和谐，实扣，cepstral峰值突出和八度。我们通过功能重建，分类和说话者身份识别任务扩展了文本到语音训练目标，这些任务将模型指导到准确的结果。除多任务外，我们还采用了Wasserstein GAN训练方案以及声学模型的输出的新损失，以进一步完善模型的质量。

Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题