寻找强大的嵌入提取器以释放扬声器

论文标题

寻找强大的嵌入提取器以释放扬声器

In search of strong embedding extractors for speaker diarisation

论文作者

Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, Chung, Joon Son

论文摘要

扬声器嵌入提取器（EES）将输入音频映射到扬声器判别潜在空间，在扬声器诊断中至关重要。但是，在采用EES进行诊断时，存在一些挑战，我们从中解决了两个关键问题。首先，评估并不简单，因为在说话者验证和腹泻之间需要更好的性能所需的功能。我们表明，在广泛采用的说话者验证评估方案上更好的表现并不会导致更好的腹泻性能。其次，嵌入提取器还没有看到存在多个说话者的话语。这些输入不可避免地是在说话者的言语和说话者变化的因素中出现的。他们降低了表演。为了减轻第一个问题，我们生成说话者验证评估协议，以更好地模仿腹泻情景。我们提出了两种数据增强技术，以减轻第二个问题，使嵌入提取器意识到言语或说话者更改输入。一种技术会产生重叠的语音段，而另一种技术会产生片段，其中两个说话者依次呈现。使用三个最先知的嵌入提取器的大量实验结果表明，这两种建议的方法都是有效的。

Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题