论文标题
瑞士议会语料库,自动使瑞士德语演讲与标准德国文本语料库
Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus
论文作者
论文摘要
我们介绍瑞士议会语料库(SPC),这是一种自动使瑞士德语演讲与标准德国文本语料库的一致。该语料库的第一个版本基于伯尔尼州议会的公开数据,由293个小时的数据组成。它是使用新颖的强制句子对准程序和一个对齐质量估计器创建的,该估计器可用于交易语料库的大小和质量。我们在数据的不同子集上训练了自动语音识别(ASR)模型,并在SPC测试集上获得了0.278的单词错误率(WER),BLEU得分为0.586。该语料库可自由下载。
We present the Swiss Parliaments Corpus (SPC), an automatically aligned Swiss German speech to Standard German text corpus. This first version of the corpus is based on publicly available data of the Bernese cantonal parliament and consists of 293 hours of data. It was created using a novel forced sentence alignment procedure and an alignment quality estimator, which can be used to trade off corpus size and quality. We trained Automatic Speech Recognition (ASR) models as baselines on different subsets of the data and achieved a Word Error Rate (WER) of 0.278 and a BLEU score of 0.586 on the SPC test set. The corpus is freely available for download.