论文标题
通过最大程度地减少语音综合信息,无监督的样式和内容分离
Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis
论文作者
论文摘要
我们提出了一种从输入文本中生成语音的方法,以及以无监督的方式从参考语音信号中提取的样式向量,即不需要不需要样式注释,例如说话者信息。在培训期间,现有的无监督方法通过计算相应地面真实样本的样式来产生语音,并使用解码器将样式向量与输入文本相结合。以这种方式训练模型将内容信息泄漏到样式矢量中。解码器可以使用泄漏的内容并忽略一些输入文本以最大程度地减少重建损失。在推理时,当参考语音与内容输入不匹配时,输出可能不包含输入文本的所有内容。我们将此问题称为“内容泄漏”,我们通过明确估计和最大程度地估算样式和内容之间的相互信息来解决该问题,并通过对抗性训练配方来解决。我们称我们的方法雾 - 基于信息的样式内容分离。该方法的主要目的是保留综合语音信号中的输入内容,我们通过单词错误率(WER)来衡量,并显示出对最先进的无监督语音综合方法的实质性改进。
We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST - Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.