半监督说话者 - 歧视声音嵌入的余弦距离虚拟对抗训练

论文标题

半监督说话者 - 歧视声音嵌入的余弦距离虚拟对抗训练

Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

论文作者

Kreyssig, Florian L., Woodland, Philip C.

论文摘要

在本文中，我们提出了一种半监督的学习（SSL）技术，用于培训深神经网络（DNNS），以生成说话者 - 歧义性声学嵌入（扬声器嵌入）。对于所需的目标域，尤其是在隐私限制下，获得大量的说话者识别培训数据可能很难。提出的技术通过利用未标记的数据来减少标记数据的要求。该技术是虚拟对抗训练（VAT）[1]的一种变体，其形式是损失的形式，该损失被定义为通过余弦距离衡量的说话者嵌入输入扰动的稳健性。因此，我们将余弦距离的虚拟对抗训练（CD-VAT）称为。与许多现有的SSL技术相比，未标记的数据不必来自与标记数据相同的类（此处扬声器）。 CD-VAT的有效性显示在2750+小时的Voxceleb数据集中，在此，相对于纯粹监督的基线，在说话者验证任务上，它的相同错误率（EER）的降低相同11.1％。如果有未标记的数据可用的说话者标签，则是监督培训的32.5％的改进。

In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by leveraging unlabelled data. The technique is a variant of virtual adversarial training (VAT) [1] in the form of a loss that is defined as the robustness of the speaker embedding against input perturbations, as measured by the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparison to many existing SSL techniques, the unlabelled data does not have to come from the same set of classes (here speakers) as the labelled data. The effectiveness of CD-VAT is shown on the 2750+ hour VoxCeleb data set, where on a speaker verification task it achieves a reduction in equal error rate (EER) of 11.1% relative to a purely supervised baseline. This is 32.5% of the improvement that would be achieved from supervised training if the speaker labels for the unlabelled data were available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题