使用非平行性语音转换和换挡数据的数据，用于低资源的文本到语音的跨言言情感转移

论文标题

使用非平行性语音转换和换挡数据的数据，用于低资源的文本到语音的跨言言情感转移

Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

论文作者

Terashima, Ryo, Yamamoto, Ryuichi, Song, Eunwoo, Shirahata, Yuma, Yoon, Hyun-Wook, Kim, Jae-Min, Tachibana, Kentaro

论文摘要

通过语音转换（VC）的数据增强已成功应用于仅可用目标扬声器的中性数据时，已成功地应用于低资源表达文本到语音（TTS）。尽管VC的质量对于这种方法至关重要，但学习稳定的VC模型是一项挑战，因为在低资源场景中的数据量受到限制，并且高度表达的语音具有较大的声学多样性。为了解决这个问题，我们提出了一种新型的数据增强方法，该方法结合了变化和风险投资技术。由于换挡数据的增强能够覆盖各种音高动态，因此即使只有目标扬声器的中性数据的1000个话语，它也可以极大地稳定VC和TTS模型的训练。主观测试结果表明，与常规方法相比，具有拟议方法的基于快速2的情绪TTS系统改善了自然性和情绪相似性。

Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic variety. To address this issue, we propose a novel data augmentation method that combines pitch-shifting and VC techniques. Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1,000 utterances of the target speaker's neutral data are available. Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题