NBC2：通过修订的窄带构象异构体的多通道语音分离

论文标题

NBC2：通过修订的窄带构象异构体的多通道语音分离

NBC2: Multichannel Speech Separation with Revised Narrow-band Conformer

论文作者

Quan, Changsheng, Li, Xiaofei

论文摘要

这项工作提出了一个多通道窄带语音分离网络。在短时傅立叶变换（STFT）域中，提出的网络处理每个频率，所有频率都使用共享网络。对于每个频率，网络都执行端到端的语音分离，即将其作为输入麦克风信号的STFT系数，并预测多个扬声器的分离的STFT系数。拟议的网络学会了将属于不同扬声器的框架的空间/转向向量聚集。它主要由三个组成部分组成。首先，一个自我发挥的网络。空间矢量的聚类与自我发注意机制具有相似的原理，这是在计算向量相似性并汇总相似向量的意义上。第二，卷积进料网络。卷积层用于信号平滑和混响处理。第三，一种新型的隐藏层归一化方法，即组批准（GBN），是为提出的窄带网络而设计的，以维持隐藏单元在频率上的分布。总体而言，提出的网络名为NBC2，因为它是我们以前的NBC（窄带构象异构体）网络的修订版。实验表明，1）所提出的网络比大幅度优于其他最先进的方法，2）相对于其他归一化方法，提出的GBN将信噪比提高了3 dB，例如批次/层/组归一化，例如，所提出的狭窄网络是Spectrum-agagnostostic IS，以及图谱的范围，以及4.通过注意图）。

This work proposes a multichannel narrow-band speech separation network. In the short-time Fourier transform (STFT) domain, the proposed network processes each frequency independently, and all frequencies use a shared network. For each frequency, the network performs end-to-end speech separation, namely taking as input the STFT coefficients of microphone signals, and predicting the separated STFT coefficients of multiple speakers. The proposed network learns to cluster the frame-wise spatial/steering vectors that belong to different speakers. It is mainly composed of three components. First, a self-attention network. Clustering of spatial vectors shares a similar principle with the self-attention mechanism in the sense of computing the similarity of vectors and then aggregating similar vectors. Second, a convolutional feed-forward network. The convolutional layers are employed for signal smoothing and reverberation processing. Third, a novel hidden-layer normalization method, i.e. group batch normalization (GBN), is especially designed for the proposed narrow-band network to maintain the distribution of hidden units over frequencies. Overall, the proposed network is named NBC2, as it is a revised version of our previous NBC (narrow-band conformer) network. Experiments show that 1) the proposed network outperforms other state-of-the-art methods by a large margin, 2) the proposed GBN improves the signal-to-distortion ratio by 3 dB, relative to other normalization methods, such as batch/layer/group normalization, 3) the proposed narrow-band network is spectrum-agnostic, as it does not learn spectral patterns, and 4) the proposed network is indeed performing frame clustering (demonstrated by the attention maps).

下载PDF全文

下载文献需遵守相关版权规定

论文标题