Interspeech 2020年非本地儿童演讲ASR挑战的NTNU系统

论文标题

Interspeech 2020年非本地儿童演讲ASR挑战的NTNU系统

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

论文作者

Lo, Tien-Hong, Chao, Fu-An, Weng, Shi-Yan, Chen, Berlin

论文摘要

本文介绍了参加ISCA Sig-Child Group支持的Interspeech 2020年非本地儿童言论ASR挑战的NTNU ASR系统。由于非本地人和儿童说话特征共存的多样性，因此这项ASR共享的任务变得更具挑战性。在封闭式评估的情况下，所有参与者都仅基于组织者提供的语音和文本语料库来开发其系统。为了解决这个资源不足的问题，我们在基于CNN-TDNNF的声学模型之上构建了ASR系统，同时利用各种数据增强策略的协同功能，包括发声和级别的速度扰动和光谱扰动和光谱增强，以及一种简单而有效的数据清洁方法。我们的ASR系统的所有变体都采用基于RNN的语言模型来挽救第一场识别假设，该假设仅在组织者发布的文本数据集上进行了培训。我们具有最佳配置的系统排在第二位，导致单词错误率（WER）为17.59％，而最出色的，第二名和官方基线系统的系统分别为15.67％，18.71％，35.09％。

This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题