使用可转移的音素嵌入的几个频道TTS

论文标题

使用可转移的音素嵌入的几个频道TTS

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

论文作者

Huang, Wei-Ping, Chen, Po-Chun, Huang, Sung-Feng, Lee, Hung-yi

论文摘要

本文研究了一个可转移的音素嵌入框架，旨在在几次设置下处理跨语性文本到语音（TTS）问题。在几乎没有射击的学习方面，转移学习是一种常见的方法，因为从头开始训练几乎没有训练数据。尽管如此，我们发现幼稚的转移学习方法在极少量的设置下未能适应看不见的语言，那里提供了不到8分钟的数据。我们通过提出一个由基于音素的TTS模型和一个代码簿模块组成的框架来解决问题，以将不同语言的音素投射到学习的潜在空间中。此外，通过利用音素级别的自我监督的学习特征，我们有效地提高了综合语音的质量。实验表明，使用4秒的数据，使用4个话语，足以在使用我们的框架适应一种看不见的语言时综合可理解的语音。

This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting. Transfer learning is a common approach when it comes to few-shot learning since training from scratch on few-shot training data is bound to overfit. Still, we find that the naive transfer learning approach fails to adapt to unseen languages under extremely few-shot settings, where less than 8 minutes of data is provided. We deal with the problem by proposing a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space. Furthermore, by utilizing phoneme-level averaged self-supervised learned features, we effectively improve the quality of synthesized speeches. Experiments show that using 4 utterances, which is about 30 seconds of data, is enough to synthesize intelligible speech when adapting to an unseen language using our framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题