论文标题
tts-by-tts 2:使用与各种自动编码器的排名载体机器的神经语音合成的数据选择增强
TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder
论文作者
论文摘要
合成语音质量的最新进展使我们能够使用合成语料库训练文本到语音(TTS)系统。但是,仅增加合成数据的量并不总是有利于提高训练效率的优势。我们在这项研究中的目标是选择性选择对培训过程有益的合成数据。在拟议的方法中,我们首先采用了一种变异自动编码器,其后验分布被用来提取代表记录和合成语料库之间声学相似性的潜在特征。通过使用那些学到的功能,我们训练排名支持向量机(RankSVM),该功能以有效地对二进制类中的相对属性进行排名而闻名。通过将记录和合成的综合类别设置为两个相反的类,rankSVM用于确定合成语音如何在听觉上与记录的数据相似。然后,从大规模合成语料库中选择了分布接近记录数据的合成TTS数据。通过使用这些数据来重新培训TTS模型,可以显着提高合成质量。客观和主观评估结果表明,所提出的方法比常规方法的优越性。
Recent advances in synthetic speech quality have enabled us to train text-to-speech (TTS) systems by using synthetic corpora. However, merely increasing the amount of synthetic data is not always advantageous for improving training efficiency. Our aim in this study is to selectively choose synthetic data that are beneficial to the training process. In the proposed method, we first adopt a variational autoencoder whose posterior distribution is utilized to extract latent features representing acoustic similarity between the recorded and synthetic corpora. By using those learned features, we then train a ranking support vector machine (RankSVM) that is well known for effectively ranking relative attributes among binary classes. By setting the recorded and synthetic ones as two opposite classes, RankSVM is used to determine how the synthesized speech is acoustically similar to the recorded data. Then, synthetic TTS data, whose distribution is close to the recorded data, are selected from large-scale synthetic corpora. By using these data for retraining the TTS model, the synthetic quality can be significantly improved. Objective and subjective evaluation results show the superiority of the proposed method over the conventional methods.