EEND-SS：灵活数量的扬声器数量

论文标题

EEND-SS：灵活数量的扬声器数量

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

论文作者

Maiti, Soumi, Ueda, Yushi, Watanabe, Shinji, Zhang, Chunlei, Yu, Meng, Zhang, Shi-Xiong, Xu, Yong

论文摘要

在本文中，我们提出了一个新颖的框架，该框架共同执行了三个任务：说话者诊断，语音分离和说话者计数。我们提出的框架将基于端到端神经腹泻（EEND）模型，与基于编码器的吸引子（EDA）（EDA）进行计数的扬声器诊断（EEND）模型集成了说话者诊断，并使用Conv-Tasnet进行了语音分离。此外，我们提出了一个多个1x1卷积层体系结构，用于估计与灵活数量的扬声器数量相对应的分离掩模，并使用获得的扬声器诊断信息来完善分离的语音信号，以改善关节框架。使用Librimix数据集的实验表明，我们所提出的方法在固定和灵活的扬声器数量的诊断和分离指标中优于单任务基准，并改善了扬声器计数扬声器的性能，用于灵活数量的扬声器。所有材料都将在ESPNET工具包中开源和可重现。

In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题