从语音中提取语言级信息的非对抗性自我监督学习

论文标题

从语音中提取语言级信息的非对抗性自我监督学习

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

论文作者

Cho, Jaejin, Villalba, Jes'us, Moro-Velazquez, Laureano, Dehak, Najim

论文摘要

在最近的研究中，自我监管的预训练模型倾向于在转移学习中优于监督的预训练模型。特别是，可以使用语音级语音表示的自我监督学习（SSL）用于语音应用中，这些语音应用需要歧视性表示话语中一致属性的表示：扬声器，语言，情感和年龄。现有的框架级别的自我监督语音表示形式，例如WAV2VEC，可以用作带有汇总的话语级表示，但这些模型通常很大。也有SSL技术可以学习话语级的表示。最成功的方法之一是一种对比方法，它需要负采样：选择替代样品与当前样品（锚）对比。但是，这并不确保所有负面样本属于与没有标签的锚类别不同的类别。本文采用一种非对抗性的自我监督方法来学习话语级的嵌入。我们对没有标签（Dino）从计算机视觉到语音进行了调整，没有标签（Dino）。与对比方法不同，Dino不需要负抽样。我们将Dino与受到监督方式训练的X-Vector进行了比较。当转移到下游任务（说话者验证，语音情绪识别（SER）和阿尔茨海默氏病检测）时，Dino的表现胜过X-Vector。我们研究了在转移学习过程中几个方面的影响，例如将微调过程分为步骤，块长度或增强。在微调过程中，首先调整最后一个仿射层，然后整个网络一次超越微调。使用较短的块长度，尽管它们产生了更多不同的输入，但并不一定会提高性能，这意味着至少需要具有特定长度的语音段才能为每个应用程序提高性能。增强对SER有帮助。

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题