论文标题
鱿鱼:以多种语言测量语音自然性
SQuId: Measuring Speech Naturalness in Many Languages
论文作者
论文摘要
文本到语音的大部分研究都依赖于人类评估,这会产生沉重的成本并减慢了发展过程。这个问题在大量多语言应用中尤为严重,在这些应用程序中,招募和投票法官可能需要数周的时间。我们介绍了Squid(语音质量识别),这是一种多语言自然性预测模型,接受了超过100万个评分的训练,并在65个地区进行了测试 - 迄今为止,这种类型的最大努力。主要的见解是,在许多地区训练一个模型始终优于单网站基线。我们介绍了我们的任务,模型,并表明它的表现优于基于W2V-bert和VoiceMos的竞争性基线50.0%。然后,我们在微调过程中证明了跨销售转移的有效性,并强调了其对零摄像区域的影响,即没有微调数据的地区。通过一系列分析,我们强调了非语言效应(例如声音伪像在跨销售转移中)的作用。最后,我们介绍了设计决策的效果,例如模型大小,预训练的多样性以及通过几个消融实验的语言重新平衡。
Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.