论文标题

半监督说话者 - 歧视声音嵌入的余弦距离虚拟对抗训练

Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

论文作者

Kreyssig, Florian L., Woodland, Philip C.

论文摘要

在本文中,我们提出了一种半监督的学习(SSL)技术,用于培训深神经网络(DNNS),以生成说话者 - 歧义性声学嵌入(扬声器嵌入)。对于所需的目标域,尤其是在隐私限制下,获得大量的说话者识别培训数据可能很难。提出的技术通过利用未标记的数据来减少标记数据的要求。该技术是虚拟对抗训练(VAT)[1]的一种变体,其形式是损失的形式,该损失被定义为通过余弦距离衡量的说话者嵌入输入扰动的稳健性。因此,我们将余弦距离的虚拟对抗训练(CD-VAT)称为。与许多现有的SSL技术相比,未标记的数据不必来自与标记数据相同的类(此处扬声器)。 CD-VAT的有效性显示在2750+小时的Voxceleb数据集中,在此,相对于纯粹监督的基线,在说话者验证任务上,它的相同错误率(EER)的降低相同11.1%。如果有未标记的数据可用的说话者标签,则是监督培训的32.5%的改进。

In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by leveraging unlabelled data. The technique is a variant of virtual adversarial training (VAT) [1] in the form of a loss that is defined as the robustness of the speaker embedding against input perturbations, as measured by the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparison to many existing SSL techniques, the unlabelled data does not have to come from the same set of classes (here speakers) as the labelled data. The effectiveness of CD-VAT is shown on the 2750+ hour VoxCeleb data set, where on a speaker verification task it achieves a reduction in equal error rate (EER) of 11.1% relative to a purely supervised baseline. This is 32.5% of the improvement that would be achieved from supervised training if the speaker labels for the unlabelled data were available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源