在嘈杂环境中进行扬声器验证的扩展U-NET

论文标题

在嘈杂环境中进行扬声器验证的扩展U-NET

Extended U-Net for Speaker Verification in Noisy Environments

论文作者

Kim, Ju-ho, Heo, Jungwoo, Shim, Hye-jin, Yu, Ha-Jin

论文摘要

背景噪声是众所周知的因素，通过模糊语音清晰度来恶化说话者验证（SV）系统的准确性和可靠性。各种研究都将单独的预估计增强模型作为噪音环境中SV系统的前端模块，这些方法有效地消除了噪音。但是，未针对SV任务量身定制的独立增强模型的转换过程也会扭曲发音中包含的扬声器信息。我们认为，在嘈杂条件下，应全面培训增强网络和嵌入提取器的发言人，以减轻此问题。因此，我们提出了一个基于U-NET的集成框架，该框架同时优化了扬声器识别和功能增强损失。此外，我们分析了直接将U-NET直接用于噪声SV任务的结构限制，并进一步提出了扩展的U-NET来减少这些缺点。我们评估了噪声合成的voxceleb1测试集上的模型，并在各种噪声场景中记录的声音开发集。实验结果表明，基于U-NET的完全关节训练框架比基线更有效，并且扩展的U-NET与最近提出的补偿系统相比表现出最先进的性能。

Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models not tailored to the SV task can also distort the speaker information included in utterances. We argue that the enhancement network and speaker embedding extractor should be fully jointly trained for SV tasks under noisy conditions to alleviate this issue. Therefore, we proposed a U-Net-based integrated framework that simultaneously optimizes speaker identification and feature enhancement losses. Moreover, we analyzed the structural limitations of using U-Net directly for noise SV tasks and further proposed Extended U-Net to reduce these drawbacks. We evaluated the models on the noise-synthesized VoxCeleb1 test set and VOiCES development set recorded in various noisy scenarios. The experimental results demonstrate that the U-Net-based fully joint training framework is more effective than the baseline, and the extended U-Net exhibited state-of-the-art performance versus the recently proposed compensation systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题