多源域的适应文本独立法医扬声器识别

论文标题

多源域的适应文本独立法医扬声器识别

Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition

论文作者

Wang, Zhenyu, Hansen, John H. L.

论文摘要

将扬声器识别系统调整为新环境是一种广泛使用的技术，可以改善从大规模数据向特定于任务的小规模数据方案中学到的出色模型。但是，先前的研究集中于单个领域的适应，这忽略了更实际的场景，在法医情景中，从所需的多个声学领域收集了训练数据。法医扬声器识别的音频分析在模型培训中带来了独特的挑战，该挑战是由于位置/场景不确定性和参考和自然主义现场记录之间的位置/场景不确定性和多样性不匹配，因此具有多域培训数据。由于域不匹配和性能损失，很难直接采用小规模的域特异性数据来训练复杂的神经网络体系结构。微调是一种适应的常用方法，以便从训练有素的模型中初始化的权重再训练。另外，在这项研究中，提出了三种基于域的对抗训练，差异最小化和力矩匹配方法的三种新型适应方法，以进一步促进多个声学领域的适应性。进行了一组全面的实验，以证明：1）不同的声学环境确实会影响说话者的识别性能，这可能会进步在音频取证中的研究，2）域的对抗性训练学会了鉴别性特征，这些特征也与域之间的变化也是不变的，并且在域之间的变化，3）差异性适应性跨度的动态构成了多个动态，并在多个范围内进行了跨度的功能，并且在4个差异上的动态构成了4个差异性，并且在4个差异上的动态构成了跨度的跨度，并且4）与4个差异化的相互作用，并且在4个差异上均可构成多个辅助性，并且4）跨越了跨度的跨度性能，并且4）跨越了多种动力，并且在4个方位上的动态构成了4次的动态性，并且4）分配对准还显着促进了每个域上的说话者识别性能，尤其是对于与所有其他系统相比，噪声的Lena-Field域。

Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Audio analysis for forensic speaker recognition offers unique challenges in model training with multi-domain training data due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and performance loss. Fine-tuning is a commonly-used method for adaptation in order to retrain the model with weights initialized from a well-trained model. Alternatively, in this study, three novel adaptation methods based on domain adversarial training, discrepancy minimization, and moment-matching approaches are proposed to further promote adaptation performance across multiple acoustic domains. A comprehensive set of experiments are conducted to demonstrate that: 1) diverse acoustic environments do impact speaker recognition performance, which could advance research in audio forensics, 2) domain adversarial training learns the discriminative features which are also invariant to shifts between domains, 3) discrepancy-minimizing adaptation achieves effective performance simultaneously across multiple acoustic domains, and 4) moment-matching adaptation along with dynamic distribution alignment also significantly promotes speaker recognition performance on each domain, especially for the LENA-field domain with noise compared to all other systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题