对抗扬声器的对抗者一致性学习使用未转录的语音数据进行零声音多扬声器文本到语音

论文标题

对抗扬声器的对抗者一致性学习使用未转录的语音数据进行零声音多扬声器文本到语音

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

论文作者

Choi, Byoung Jin, Jeong, Myeonghun, Kim, Minchan, Mun, Sung Hwan, Kim, Nam Soo

论文摘要

最近提出的一些文本到语音（TTS）模型实现了以人为质量的单语言和多演讲者TTS方案的质量生成语音样本，并使用一组预定的扬声器。但是，通过单个参考音频（通常称为零击的多演讲者文本到语音）综合新扬声器的声音（ZSM-TTS）仍然是一项非常具有挑战性的任务。 ZSM-TTS的主要挑战是扬声器的言语域转移问题。为了减轻这个问题，我们提出了对抗说话者一致性学习（ASCL）。提出的方法首先在每个培训迭代中使用外部未转录数据集生成查询扬声器的额外语音。然后，该模型学会通过采用对抗性学习方案来始终如一地生成与相应说话者嵌入向量的同一说话者的语音样本。实验结果表明，就ZSM-TT的质量和说话者的相似性而言，该方法与基线相比是有效的。

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial learning scheme. The experimental results show that the proposed method is effective compared to the baseline in terms of the quality and speaker similarity in ZSM-TTS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题