时间空间神经滤波器：方向知情的端到端多通道目标语音分离

论文标题

时间空间神经滤波器：方向知情的端到端多通道目标语音分离

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

论文作者

Gu, Rongzhi, Zou, Yuexian

论文摘要

目标语音分离是指从混合信号中提取目标扬声器的语音。尽管最新的基于深度学习的封闭式语音分离取得了进步，但对现实世界的应用仍然是一个悬而未决的问题。两个主要的挑战是复杂的声学环境和实时处理要求。为了应对这些挑战，我们提出了一个时间空间神经滤波器，该滤波器直接估算了回合环境中多演讲者混合物的目标语音波形，并辅助了说话者的定向信息。首先，针对复杂环境带来的变化，关键思想是通过共同建模目标和干扰源之间的时间，光谱和空间区分性来提高声学表示的完整性。具体而言，将时间，光谱，空间以及设计的方向性特征集成以创建关节声音表示。其次，为了减少延迟，我们设计了一个完全横向的自动编码器框架，该框架纯粹是端到端和单通道。所有功能计算均由网络层和操作实现，以加快分离过程。在与说话者无关的情况下，对模拟的混响数据集WSJ0-2MIX和WSJ0-3MIX进行了评估。实验结果表明，所提出的方法优于最先进的基于深度学习的多通道方法，其参数较少，并且处理速度更快。此外，提出的时间空间神经滤波器可以处理具有不同且未知的扬声器数量的混合物，即使存在方向估计误差，也可以表现出持久性能。代码和模型将很快发布。

Target speech separation refers to extracting the target speaker's speech from mixed signals. Despite the recent advances in deep learning based close-talk speech separation, the applications to real-world are still an open issue. Two main challenges are the complex acoustic environment and the real-time processing requirement. To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s). Firstly, against variations brought by complex environment, the key idea is to increase the acoustic representation completeness through the jointly modeling of temporal, spectral and spatial discriminability between the target and interference source. Specifically, temporal, spectral, spatial along with the designed directional features are integrated to create a joint acoustic representation. Secondly, to reduce the latency, we design a fully-convolutional autoencoder framework, which is purely end-to-end and single-pass. All the feature computation is implemented by the network layers and operations to speed up the separation procedure. Evaluation is conducted on simulated reverberant dataset WSJ0-2mix and WSJ0-3mix under speaker-independent scenario. Experimental results demonstrate that the proposed method outperforms state-of-the-art deep learning based multi-channel approaches with fewer parameters and faster processing speed. Furthermore, the proposed temporal-spatial neural filter can handle mixtures with varying and unknown number of speakers and exhibits persistent performance even when existing a direction estimation error. Codes and models will be released soon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题