论文标题
用端到端过滤器方法进行语音分离的深度注意融合功能
Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
论文作者
论文摘要
在本文中,我们提出了一种端到端的后过滤方法,具有深度注意融合特征,用于单声道扬声器独立的语音分离。首先,将时间频域语音分离方法应用为分离阶段。分类前阶段的目的是初步分离混合物。尽管此阶段可以将混合物分开,但仍包含残留干扰。为了增强前分离的语音并进一步提高分离性能,提出了具有深度注意融合特征的端到端过滤器(E2EPF)。 E2EPF可以充分利用预先分离的语音的先验知识,这有助于语音分离。这是一个完全卷积的语音分离网络,并将波形用作输入特征。首先,使用1-D卷积层来提取混合物的深度表示特征和时间域中的预分离信号。其次,要更多地关注分离前阶段的输出,将注意模块应用于获得深度注意融合特征,这些特征是通过计算混合物与预分离语音之间的相似性来提取的。这些深刻的注意融合特征有利于减少干扰并增强预先分离的语音。最后,将这些功能发送到后过滤器以估计每个目标信号。 WSJ0-2MIX数据集的实验结果表明,所提出的方法的表现优于最先进的语音分离方法。与分离前方法相比,我们提出的方法可以获得64.1%,60.2%,25.6%和7.5%的相对相对改善的规模不变源源与噪声比(SI-SNR),信噪比(SDR),语音质量(PESQ)的感知评估(PESQ)以及短时的目标(STOICIE)衡量标准(STO)。
In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.