TF-Gridnet：使单声道扬声器分离再次制作时频域模型

论文标题

TF-Gridnet：使单声道扬声器分离再次制作时频域模型

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

论文作者

Wang, Zhong-Qiu, Cornell, Samuele, Choi, Shukjae, Lee, Younglo, Kim, Byeong-Yeol, Watanabe, Shinji

论文摘要

我们提出了TF-Gridnet，这是一种在时间频率（T-F）域中运行的新型多路径深神经网络（DNN），用于单脉说话者与单声道聊天者无关的扬声器分离。该模型堆叠了几个多路径块，每个块由框架内光谱模块，一个次频段时间模块和全频段自我启动模块组成，以利用本地和全局的光谱信息信息进行分离。该模型经过训练以执行复杂的光谱映射，其中将输入混合物的真实和虚构（RI）组件堆叠为输入特征，以预测目标RI组件。除了使用模型训练的标准不变信号距离（SI-SDR）损失外，我们还包括一个新颖的损失项，以鼓励分离的来源加起来输入混合物。在不使用动态混合的情况下，我们在WSJ0-2MIX数据集上获得了23.4 dB Si-SDR改进（SI-SDRI），从而超过了以前的最佳距离。

We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal information for separation. The model is trained to perform complex spectral mapping, where the real and imaginary (RI) components of the input mixture are stacked as input features to predict target RI components. Besides using the scale-invariant signal-to-distortion ratio (SI-SDR) loss for model training, we include a novel loss term to encourage separated sources to add up to the input mixture. Without using dynamic mixing, we obtain 23.4 dB SI-SDR improvement (SI-SDRi) on the WSJ0-2mix dataset, outperforming the previous best by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题