通用电话：用于强大声学建模的多语言数据集

论文标题

通用电话：用于强大声学建模的多语言数据集

Common Phone: A Multilingual Dataset for Robust Acoustic Modelling

论文作者

Klumpp, Philipp, Arias-Vergara, Tomás, Pérez-Toro, Paula Andrea, Nöth, Elmar, Orozco-Arroyave, Juan Rafael

论文摘要

当前的最先进的声学模型很容易构成超过1亿个参数。这种日益增长的复杂性需要更大的培训数据集，以维持最终决策功能的体面概括。理想的数据集的尺寸不一定很大，但是相对于独特的扬声器数量，使用硬件和不同的录制条件。这使机器学习模型可以在参数估计过程中探索尽可能多的特定域输入空间。这项工作介绍了通用电话，这是一种通过Mozilla的共同语音项目从超过11.000个贡献者录制的性别平衡，多语种语料库。它包含大约116个小时的语音，并具有自动产生的语音分割。 WAV2VEC 2.0声学模型接受了通用手机的训练，以执行语音符号识别并验证生成的语音注释的质量。该体系结构在整个测试集上达到的18.1％，并使用所有101个独特的语音符号计算，显示了单个语言之间的微小差异。我们得出的结论是，通用电话提供了足够的可变性和可靠的语音注释，以帮助弥合声学模型的研究和应用之间的差距。

Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 11.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题