自适应多通道神经语音分离的变分自动编码器和空间聚类的集成

论文标题

自适应多通道神经语音分离的变分自动编码器和空间聚类的集成

Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

论文作者

Zmolikova, Katerina, Delcroix, Marc, Burget, Lukáš, Nakatani, Tomohiro, Černocký, Jan "Honza"

论文摘要

在本文中，我们提出了一种将语音变异自动编码器模型与多通道语音分离的空间聚类方法相结合的方法。几项作品显示了将空间聚类与光谱模型集成的优点。作为光谱模型，以前的作品使用了混合语音的阶乘生成模型或歧视性神经网络。在我们的工作中，我们通过建立基于生成神经网络的阶乘模型来结合两种方法的优势。通过这样做，我们可以利用神经网络的建模能力，但同时保持结构化模型。在适应新的噪声条件时，这种模型可以是有利的，因为仅需要修改模型的噪声部分。我们通过实验表明，我们的模型明显优于基于高斯混合模型（Dolphin）的先前阶乘模型，它与空间聚类的排列不变训练的整合相当，使我们能够轻松适应新的噪声条件。该方法的代码可从https://github.com/butspeechfit/vae_dolphin获得

In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work, we combine the strengths of both approaches, by building a factorial model based on a generative neural network, a variational autoencoder. By doing so, we can exploit the modeling power of neural networks, but at the same time, keep a structured model. Such a model can be advantageous when adapting to new noise conditions as only the noise part of the model needs to be modified. We show experimentally, that our model significantly outperforms previous factorial model based on Gaussian mixture model (DOLPHIN), performs comparably to integration of permutation invariant training with spatial clustering, and enables us to easily adapt to new noise conditions. The code for the method is available at https://github.com/BUTSpeechFIT/vae_dolphin

下载PDF全文

下载文献需遵守相关版权规定

论文标题