将空间聚类与LSTM语音模型相结合，以增强多通道语音

论文标题

将空间聚类与LSTM语音模型相结合，以增强多通道语音

Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

论文作者

Grezes, Felix, Ni, Zhaoheng, Trinh, Viet Anh, Mandel, Michael

论文摘要

使用LSTM体系结构的复发神经网络可以实现明显的单渠道降噪。但是，并不明显地将它们应用于多通道输入，以一种可以推广到新的麦克风配置的方式。相反，空间聚类技术可以实现这种概括，但缺乏强大的信号模型。本文结合了两种方法，以达到多通道空间聚类的空间分离性能和通用性以及多个平行单渠道LSTM LSTM语音增强器的信号建模性能。根据PESQ算法预测的语音质量和在错误匹配的条件下训练的识别器的单词错误率，将系统与Chime3数据集上的几个基线进行比较，以专注于概括。我们的实验表明，通过将LSTM模型与空间聚类相结合，我们将单词错误率降低了4.6 \％\％的绝对（17.2 \％相对），而与空间群集系统相比，测试集对测试集的11.2 \％绝对（25.5 \％\％相对），并降低了10.55 \％的开发（32.72 \％）（32.72 \％）（32.72 \％）（32.72 \％）（32.72 \％）。与LSTM模型相比，测试数据的（15.76 \％相对）。

Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction. It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations. In contrast, spatial clustering techniques can achieve such generalization, but lack a strong signal model. This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering and the signal modeling performance of multiple parallel single-channel LSTM speech enhancers. The system is compared to several baselines on the CHiME3 dataset in terms of speech quality predicted by the PESQ algorithm and word error rate of a recognizer trained on mis-matched conditions, in order to focus on generalization. Our experiments show that by combining the LSTM models with the spatial clustering, we reduce word error rate by 4.6\% absolute (17.2\% relative) on the development set and 11.2\% absolute (25.5\% relative) on test set compared with spatial clustering system, and reduce by 10.75\% (32.72\% relative) on development set and 6.12\% absolute (15.76\% relative) on test data compared with LSTM model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题