快速实时个性化语音增强：端到端增强网络（E3NET）和知识蒸馏

论文标题

快速实时个性化语音增强：端到端增强网络（E3NET）和知识蒸馏

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

论文作者

Thakker, Manthan, Eskimez, Sefik Emre, Yoshioka, Takuya, Wang, Huaming

论文摘要

本文研究了如何提高个性化语音增强（PSE）网络的运行速度，同时保持模型质量。我们的方法包括两个方面：建筑和知识蒸馏（KD）。我们提出了一个端到端增强（E3NET）模型体系结构，比基线基于STFT的模型快$ 3 \ times $。此外，我们使用KD技术来开发压缩的学生模型，而不会显着降低质量。此外，我们使用嘈杂的数据在没有参考清洁信号训练学生模型的情况下进行了调查，其中我们使用自动语音识别（ASR）损失将KD与多任务学习（MTL）结合在一起。我们的结果表明，与基线模型相比，E3NET提供了更好的语音和转录质量，而目标扬声器过度抑制（TSO）速率较低。此外，我们表明KD方法可以产生比老师快$ 2-4 \ times $ $的学生模型，并提供合理的质量。组合KD和MTL可改善ASR和TSOS指标，而不会降低语音质量。

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3\times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are $2-4\times$ faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题