具有教师学生学习的超快速语音分离模型

论文标题

具有教师学生学习的超快速语音分离模型

Ultra Fast Speech Separation Model with Teacher Student Learning

论文作者

Chen, Sanyuan, Wu, Yu, Chen, Zhuo, Wu, Jian, Yoshioka, Takuya, Liu, Shujie, Li, Jinyu, Yu, Xiangzhan

论文摘要

最近，使用自发机制，具有强大的长依赖性建模能力成功地应用于语音分离。但是，由于深层编码层，变压器往往具有巨大的运行时间成本，这阻碍了其在边缘设备上的部署。对于计算效率而言，首选具有较少编码器层的小型变压器模型，但易于性能降解。在本文中，提出了一个超快速的语音分离变压器模型，以通过教师学习（T-S学习）实现更好的表现和效率。我们介绍了层的T-S学习和客观转移机制，以指导小型学生模型，以从大型教师模型中学习中间表示。与从头开始训练的小型变压器模型相比，提出的T-S学习方法将单词错误率（WER）降低了5％以上，多通道和单渠道数据集上的单渠道语音分离。利用更多未标记的语音数据，我们的超快速语音分离模型实现了超过10％的相对降低。

Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layer-wise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题