对自我监督的说话者认可的增强对抗性培训

论文标题

对自我监督的说话者认可的增强对抗性培训

Augmentation adversarial training for self-supervised speaker recognition

论文作者

Huh, Jaesung, Heo, Hee Soo, Kang, Jingu, Watanabe, Shinji, Chung, Joon Son

论文摘要

这项工作的目的是训练没有扬声器标签的强大扬声器识别模型。关于无监督的说话者表示的最新著作是基于对比度学习的，在这些学习中，它们鼓励内部含量相似，并且相互融合的嵌入是不同的。但是，由于内部范围具有相同的声学特征，因此很难将扬声器信息与频道信息分开。为此，我们提出了增强对抗性训练策略，该策略训练网络对说话者信息具有歧视性，而对应用的增强不变。由于增强性模拟了声学特征，因此训练网络不变以增强，也鼓励网络通常对渠道信息不变。在Voxceleb和Voices数据集上进行了广泛的实验，使用自学意义重视，比以前的作品有了显着改善，而我们的自我监管模型的性能远远超过了人类的模型。

The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans.

下载PDF全文

下载文献需遵守相关版权规定

论文标题