用端到端过滤器方法进行语音分离的深度注意融合功能

论文标题

用端到端过滤器方法进行语音分离的深度注意融合功能

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

论文作者

Fan, Cunhang, Tao, Jianhua, Liu, Bin, Yi, Jiangyan, Wen, Zhengqi, Liu, Xuefei

论文摘要

在本文中，我们提出了一种端到端的后过滤方法，具有深度注意融合特征，用于单声道扬声器独立的语音分离。首先，将时间频域语音分离方法应用为分离阶段。分类前阶段的目的是初步分离混合物。尽管此阶段可以将混合物分开，但仍包含残留干扰。为了增强前分离的语音并进一步提高分离性能，提出了具有深度注意融合特征的端到端过滤器（E2EPF）。 E2EPF可以充分利用预先分离的语音的先验知识，这有助于语音分离。这是一个完全卷积的语音分离网络，并将波形用作输入特征。首先，使用1-D卷积层来提取混合物的深度表示特征和时间域中的预分离信号。其次，要更多地关注分离前阶段的输出，将注意模块应用于获得深度注意融合特征，这些特征是通过计算混合物与预分离语音之间的相似性来提取的。这些深刻的注意融合特征有利于减少干扰并增强预先分离的语音。最后，将这些功能发送到后过滤器以估计每个目标信号。 WSJ0-2MIX数据集的实验结果表明，所提出的方法的表现优于最先进的语音分离方法。与分离前方法相比，我们提出的方法可以获得64.1％，60.2％，25.6％和7.5％的相对相对改善的规模不变源源与噪声比（SI-SNR），信噪比（SDR），语音质量（PESQ）的感知评估（PESQ）以及短时的目标（STOICIE）衡量标准（STO）。

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题