论文标题
Interspeech 2020年非本地儿童演讲ASR挑战的NTNU系统
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
论文作者
论文摘要
本文介绍了参加ISCA Sig-Child Group支持的Interspeech 2020年非本地儿童言论ASR挑战的NTNU ASR系统。由于非本地人和儿童说话特征共存的多样性,因此这项ASR共享的任务变得更具挑战性。在封闭式评估的情况下,所有参与者都仅基于组织者提供的语音和文本语料库来开发其系统。为了解决这个资源不足的问题,我们在基于CNN-TDNNF的声学模型之上构建了ASR系统,同时利用各种数据增强策略的协同功能,包括发声和级别的速度扰动和光谱扰动和光谱增强,以及一种简单而有效的数据清洁方法。我们的ASR系统的所有变体都采用基于RNN的语言模型来挽救第一场识别假设,该假设仅在组织者发布的文本数据集上进行了培训。我们具有最佳配置的系统排在第二位,导致单词错误率(WER)为17.59%,而最出色的,第二名和官方基线系统的系统分别为15.67%,18.71%,35.09%。
This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.