基于韵律补偿的跨言式情绪转移端到端语音综合

论文标题

基于韵律补偿的跨言式情绪转移端到端语音综合

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

论文作者

Li, Tao, Wang, Xinsheng, Xie, Qicong, Wang, Zhichao, Jiang, Mingqi, Xie, Lei

论文摘要

跨言情感转移语音综合旨在通过将情感转移到另一个（来源）演讲者记录的参考语音中，从而为目标说话者综合情感语音。在这项任务中，从参考语音中提取嵌入说话者的情绪起着重要作用。但是，在此过程中，通过这种情感嵌入的情感信息往往会削弱，以挤出源扬声器的音色信息。针对此问题，本文提出了一个韵律补偿模块（PCM），以补偿情绪信息损失。具体而言，PCM试图从预先训练的ASR模型的中间特征中获取与说话者无关的情感信息。为此，引入了具有全球环境（GC）块的韵律补偿编码器，以从ASR模型的中间功能中获取全球情感信息。实验表明，提出的PCM可以有效地补偿情绪信息丧失的情绪，同时保持目标扬声器的音色。与最先进的模型的比较表明，我们提出的方法对跨语言者的情感转移任务提出了明显的优势。

Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题