论文标题
Speechmatrix:多语言语音到语音翻译的大规模开采语料库
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
论文作者
论文摘要
我们介绍了Speechmatrix,这是一种大规模的多语言语音语音语音语音,言语翻译是根据欧洲议会录音的真实言语开采的。它包含136个语言对的语音对准,总共有4.18千小时的语音。为了评估这种平行语音的质量,我们仅在挖掘的数据上训练双语语音到语音翻译模型,并在Europarl-St,Voxpopuli和Fleurs测试集上建立广泛的基线结果。由SpeechMatrix的多语言启用,我们还探索了多语言语音到语音翻译,这一主题是由其他一些作品所解决的。我们还证明,使用杂种的模型预训练和稀疏缩放缩放为翻译性能带来了很大的增长。挖掘的数据和模型是免费的。
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.