论文标题
lahjoita puhetta-大规模的芬兰语语料库和一些基准测试
Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some benchmarks
论文作者
论文摘要
迄今为止,捐赠的演讲运动已经成功地收集了大约3600个普通的通俗性芬兰演讲,以纳入Lahjoita Puhetta(Donate Speak)语料库。该语料库包括来自芬兰所有地区和所有年龄段的2万名演讲者。该系列的主要目标是创建一个代表性的大规模资源,以研究自发的芬兰语,并加快语言技术和基于语音的服务的发展。在本文中,我们介绍收集过程和收集的语料库,并通过多种用例展示其多功能性。评估的用例包括:自动语音识别自发语音,年龄的检测,性别,方言和主题以及元数据分析。我们为用例提供基准,以及具有开源代码可重现的可重现代码的可负载,训练有素的基线系统。另一种用例是验证该语料库本身给出的元数据和成绩单,并建议对丢失的语料库部分的人工元数据和成绩单。
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services. In this paper, we present the collection process and the collected corpus, and showcase its versatility through multiple use cases. The evaluated use cases include: automatic speech recognition of spontaneous speech, detection of age, gender, dialect and topic and metadata analysis. We provide benchmarks for the use cases, as well down loadable, trained baseline systems with open-source code for reproducibility. One further use case is to verify the metadata and transcripts given in this corpus itself, and to suggest artificial metadata and transcripts for the part of the corpus where it is missing.