论文标题
X-TASNET:稳健而准确的时间域扬声器提取网络
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
论文作者
论文摘要
基于目标发言人的参考语音,从混合音频中提取目标发言人的语音是一项具有挑战性而强大的技术技术技术。最近对诸如TASNET之类的语音分离的研究,通过在时间域波形上应用深度神经网络,显示出令人鼓舞的结果。当指定目标扬声器时,这种分离神经网络不会直接产生可靠,准确的输出,因为在与缺少扬声器的音频打交道时,扬声器的数量和缺乏稳健性的必要性。在本文中,我们通过引入一种称为X-TASNET的新扬声器语音掩盖方法来打破这些局限性。我们的提案采用了新的策略,包括基于失真的损失和相应的交替培训计划,以更好地解决鲁棒性问题。 X-TASNET显着提高了提取的语音质量,在最先进的语音过滤方法上,输出语音音频的SDRI和SI-SNRI加倍。 X-TASNET还通过将输出音频中的说话者身份的准确性提高到95.4%,从而提高结果的可靠性,从而在不存在目标扬声器时在大多数情况下返回无声音频。这些结果表明,X-TASNET向更实际的说话者提取技术采用了一个坚实的步骤。
Extracting the speech of a target speaker from mixed audios, based on a reference speech from the target speaker, is a challenging yet powerful technology in speech processing. Recent studies of speaker-independent speech separation, such as TasNet, have shown promising results by applying deep neural networks over the time-domain waveform. Such separation neural network does not directly generate reliable and accurate output when target speakers are specified, because of the necessary prior on the number of speakers and the lack of robustness when dealing with audios with absent speakers. In this paper, we break these limitations by introducing a new speaker-aware speech masking method, called X-TaSNet. Our proposal adopts new strategies, including a distortion-based loss and corresponding alternating training scheme, to better address the robustness issue. X-TaSNet significantly enhances the extracted speech quality, doubling SDRi and SI-SNRi of the output speech audio over state-of-the-art voice filtering approach. X-TaSNet also improves the reliability of the results by improving the accuracy of speaker identity in the output audio to 95.4%, such that it returns silent audios in most cases when the target speaker is absent. These results demonstrate X-TaSNet moves one solid step towards more practical applications of speaker extraction technology.