评估和减少合成和真实语音分布之间的距离

论文标题

评估和减少合成和真实语音分布之间的距离

Evaluating and reducing the distance between synthetic and real speech distributions

论文作者

Minixhofer, Christoph, Klejch, Ondřej, Bell, Peter

论文摘要

尽管现代文本到语音（TTS）系统可以产生自然的语音，但它们仍然无法再现自然语音数据中发现的全部多样性。我们考虑使用特定的TTS系统，这些说话者可以生成所有可能生成的所有可能的真实语音样本的分布，这些样本可以生成。我们着手通过一系列与说话者，语音韵律和声学环境有关的话语级统计数据来量化真实语音和合成语音之间的距离。使用Wasserstein距离评估这些统计数据的分布差异。我们通过在生成时间提供地面真实值来降低这些距离，并使用自动语音识别系统近似于整体分布距离的改进。我们的最佳系统可实现10 \％的分配距离。

While modern Text-to-Speech (TTS) systems can produce natural-sounding speech, they remain unable to reproduce the full diversity found in natural speech data. We consider the distribution of all possible real speech samples that could be generated by these speakers alongside the distribution of all synthetic samples that could be generated for the same set of speakers, using a particular TTS system. We set out to quantify the distance between real and synthetic speech via a range of utterance-level statistics related to properties of the speaker, speech prosody and acoustic environment. Differences in the distribution of these statistics are evaluated using the Wasserstein distance. We reduce these distances by providing ground-truth values at generation time, and quantify the improvements to the overall distribution distance, approximated using an automatic speech recognition system. Our best system achieves a 10\% reduction in distribution distance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题