论文标题

在会议方案下,同时针对多个目标扬声器进行言语提取

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios

论文作者

Zeng, Bang, Suo, Hongbing, Wan, Yulong, Li, Ming

论文摘要

共同的目标语音分离直接估计目标源,而忽略了每个帧不同扬声器之间的相互关系。我们建议一个多目标语音分离模型(MTSS)同时从混合语音中提取每个说话者的声音,而不仅仅是最佳估计目标源。此外,我们提出了一个由SD模块和MTSS模块组成的说话者诊断(SD)Aware MTSS System(SD-MTSS)。通过利用TSVAD决策和估计的面具,我们的SD-MTSS模型可以在对话录制中同时提取每个说话者的语音信号,而无需事先进行其他注册音频。实验结果表明,我们的MTSS模型分别在WSJ0-2MIX-MIX-EXTR数据集上分别实现了1.38db SDR,1.34DB SI-SDR和0.13 PESQ的改进。 SD-MTSS系统使鉴定依赖说话者的字符错误率(CER)降低了Alimeeting数据集的19.2%。

The common target speech separation directly estimate the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation model (MTSS) to simultaneously extract each speaker's voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS), which consists of a SD module and MTSS module. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves 1.38dB SDR, 1.34dB SI-SDR, and 0.13 PESQ improvements over the baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes 19.2% relative speaker dependent character error rate (CER) reduction on the Alimeeting dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源