MNTT：开源蒙古文字到语音合成数据集和伴随的基线

论文标题

MNTT：开源蒙古文字到语音合成数据集和伴随的基线

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

论文作者

Hu, Yifan, Yin, Pengkai, Liu, Rui, Bao, Feilong, Gao, Guanglai

论文摘要

本文介绍了蒙古人的高质量开源文本到语音（TTS）合成数据集，蒙古是一种低资源的语言，该语言是全球超过1000万人所说的。该数据集名为MNTT，由22岁专业女性蒙古播音员说的大约8个小时的录音录音组成。它是第一个开发的公开数据集，旨在促进学术界和行业中的蒙古TTS应用程序。在本文中，我们通过描述数据集开发程序并面临挑战来分享我们的经验。为了证明数据集的可靠性，我们基于FastSpeech2模型和Hifi-Gan Vocoder构建了一个强大的非入学基线系统，并使用主观平均意见分数（MOS）和实时因素（RTF）指标对其进行了评估。评估结果表明，在我们的数据集上训练的功能强大的基线系统在4和RTF上的MOS达到了$ 3.30 \ times10^{ - 1} $，这使其适用于实际使用。数据集，培训配方和预估计的TTS模型是免费可用的\ footNote {\ label {github} \ url {https://github.com/walker-hyf/mntts}}}。

This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges. To demonstrate the reliability of our dataset, we built a powerful non-autoregressive baseline system based on FastSpeech2 model and HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score (MOS) and real time factor (RTF) metrics. Evaluation results show that the powerful baseline system trained on our dataset achieves MOS above 4 and RTF about $3.30\times10^{-1}$, which makes it applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available \footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题