论文标题
从声学模型的合奏中提取知识,以进行CTC注意的端到端语音识别
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition
论文作者
论文摘要
知识蒸馏已被广泛用于压缩现有的深度学习模型,同时在广泛的应用程序上保留性能。在自动语音识别(ASR)的具体情况下,声学模型的集合的蒸馏最近显示出有希望的识别性能的结果。在本文中,我们建议将多教学蒸馏方法扩展到端到端ASR系统的联合CTC注意。我们还引入了三种新型的蒸馏策略。它们背后的核心直觉是将错误率指标集成到教师选择中,而不仅仅是专注于观察到的损失。通过这种方式,我们直接提炼学生并优化学生的语音识别指标。我们在选择不同数据集(Timit,Librispeech,Common Voice)和各种语言(英语,法语,意大利语)上评估了这些策略。特别是,在普通语音法语,意大利语和TIMIT数据集上报告了最新的错误率。
Knowledge distillation has been widely used to compress existing deep learning models while preserving the performance on a wide range of applications. In the specific context of Automatic Speech Recognition (ASR), distillation from ensembles of acoustic models has recently shown promising results in increasing recognition performance. In this paper, we propose an extension of multi-teacher distillation methods to joint CTC-attention end-to-end ASR systems. We also introduce three novel distillation strategies. The core intuition behind them is to integrate the error rate metric to the teacher selection rather than solely focusing on the observed losses. In this way, we directly distill and optimize the student toward the relevant metric for speech recognition. We evaluate these strategies under a selection of training procedures on different datasets (TIMIT, Librispeech, Common Voice) and various languages (English, French, Italian). In particular, state-of-the-art error rates are reported on the Common Voice French, Italian and TIMIT datasets.