多种因素解耦的多演讲者表达语音综合

论文标题

多种因素解耦的多演讲者表达语音综合

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

论文作者

Zhu, Xinfa, Lei, Yi, Song, Kun, Zhang, Yongmao, Li, Tao, Xie, Lei

论文摘要

本文旨在通过转移其他演讲者记录的参考语音中的风格和情感来综合目标演讲者的演讲。我们通过由文本到风格的（text2SE）模块和样式和声音对波（SE2Wave）模块组成的两个阶段框架解决了这个具有挑战性的问题，该模块是由神经瓶颈（BN）特征桥接的。为了进一步解决多因素（扬声器的音素，口语风格和情感）脱钩问题，我们采用了多标签二进制载体（MBV）和共同信息（MI）最小化，以分别离散提取的嵌入式嵌入，并在Text2Se和Se2wave模块中删除这些高度纠缠的因素。此外，我们引入了半监督的培训策略，以利用来自多个扬声器的数据，包括情感标记的数据，样式标记的数据和未标记的数据。为了更好地将细粒度的表达式从非并行传递中的参考转移到目标扬声器，我们引入了参考候选池，并提出了一种基于注意力的参考选择方法。广泛的实验证明了我们模型的良好设计。

This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题