在会议方案下，同时针对多个目标扬声器进行言语提取

论文标题

在会议方案下，同时针对多个目标扬声器进行言语提取

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios

论文作者

Zeng, Bang, Suo, Hongbing, Wan, Yulong, Li, Ming

论文摘要

共同的目标语音分离直接估计目标源，而忽略了每个帧不同扬声器之间的相互关系。我们建议一个多目标语音分离模型（MTSS）同时从混合语音中提取每个说话者的声音，而不仅仅是最佳估计目标源。此外，我们提出了一个由SD模块和MTSS模块组成的说话者诊断（SD）Aware MTSS System（SD-MTSS）。通过利用TSVAD决策和估计的面具，我们的SD-MTSS模型可以在对话录制中同时提取每个说话者的语音信号，而无需事先进行其他注册音频。实验结果表明，我们的MTSS模型分别在WSJ0-2MIX-MIX-EXTR数据集上分别实现了1.38db SDR，1.34DB SI-SDR和0.13 PESQ的改进。 SD-MTSS系统使鉴定依赖说话者的字符错误率（CER）降低了Alimeeting数据集的19.2％。

The common target speech separation directly estimate the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation model (MTSS) to simultaneously extract each speaker's voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS), which consists of a SD module and MTSS module. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves 1.38dB SDR, 1.34dB SI-SDR, and 0.13 PESQ improvements over the baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes 19.2% relative speaker dependent character error rate (CER) reduction on the Alimeeting dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题