使用原始波形进行短话语扬声器验证的段聚合

论文标题

使用原始波形进行短话语扬声器验证的段聚合

Segment Aggregation for short utterances speaker verification using raw waveforms

论文作者

Kim, Seung-bin, Jung, Jee-weon, Shim, Hye-jin, Kim, Ju-ho, Yu, Ha-Jin

论文摘要

大多数关于说话者验证系统的研究都集中在长期式的话语上，这些话语由足够的语音信息组成。但是，由于缺乏语音信息，与长话相比，由于缺乏语音信息，这些系统的性能会降低。在本文中，我们提出了一种补偿说话者对简短话语（称为“段聚集”）的性能下降的方法。提出的方法采用基于合奏的设计来提高说话者验证系统的稳定性和准确性。提出的方法将输入话语分为几个简短的话语，然后汇总从分段输入中提取的片段嵌入，以构成扬声器嵌入。然后，该方法同时训练片段嵌入和聚合的扬声器嵌入。此外，我们还修改了所提出方法的教师学习方法。使用Voxceleb1测试集对不同输入持续时间进行的实验结果表明，与基线系统相比，该技术的扬声器验证性能提高了约45.37％，而具有1秒的测试说法条件。

Most studies on speaker verification systems focus on long-duration utterances, which are composed of sufficient phonetic information. However, the performances of these systems are known to degrade when short-duration utterances are inputted due to the lack of phonetic information as compared to the long utterances. In this paper, we propose a method that compensates for the performance degradation of speaker verification for short utterances, referred to as "segment aggregation". The proposed method adopts an ensemble-based design to improve the stability and accuracy of speaker verification systems. The proposed method segments an input utterance into several short utterances and then aggregates the segment embeddings extracted from the segmented inputs to compose a speaker embedding. Then, this method simultaneously trains the segment embeddings and the aggregated speaker embedding. In addition, we also modified the teacher-student learning method for the proposed method. Experimental results on different input duration using the VoxCeleb1 test set demonstrate that the proposed technique improves speaker verification performance by about 45.37% relatively compared to the baseline system with 1-second test utterance condition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题