与神经声码器的多目标情感语音转换

论文标题

与神经声码器的多目标情感语音转换

Multi-Target Emotional Voice Conversion With Neural Vocoders

论文作者

Liu, Songxiang, Cao, Yuewen, Meng, Helen

论文摘要

情感语音转换（EVC）是产生表达性综合语音的一种方法。先前的方法主要侧重于建模一对一的映射，即使用Mel-Cepstral Vocoders从一个情绪状态转换为另一种情绪状态。在本文中，我们研究了建立一个多目标EVC（MTEVC）架构，该体系结合了基于双向双向长期记忆（DBLSTM）的转换模型和神经vocoder。包含丰富语言信息的语音后验（PPG）作为辅助输入特征将富含语言信息的内容纳入转换模型，从而提高转换性能。为了利用新出现的神经声码编码的优势，我们将条件性象征器和基于流动的象征（Flowavenet）作为语音发生器进行了研究。 Vocoders将其他扬声器信息和情感信息作为辅助功能，并通过多演讲者和多情绪语音语料库进行培训。实验结果的客观指标和主观评估验证了EVC提出的MTEVC结构的功效。

Emotional voice conversion (EVC) is one way to generate expressive synthetic speech. Previous approaches mainly focused on modeling one-to-one mapping, i.e., conversion from one emotional state to another emotional state, with Mel-cepstral vocoders. In this paper, we investigate building a multi-target EVC (MTEVC) architecture, which combines a deep bidirectional long-short term memory (DBLSTM)-based conversion model and a neural vocoder. Phonetic posteriorgrams (PPGs) containing rich linguistic information are incorporated into the conversion model as auxiliary input features, which boost the conversion performance. To leverage the advantages of the newly emerged neural vocoders, we investigate the conditional WaveNet and flow-based WaveNet (FloWaveNet) as speech generators. The vocoders take in additional speaker information and emotion information as auxiliary features and are trained with a multi-speaker and multi-emotion speech corpus. Objective metrics and subjective evaluation of the experimental results verify the efficacy of the proposed MTEVC architecture for EVC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题