ContentVec：通过解开演讲者的改进的自我监督语音表示

论文标题

ContentVec：通过解开演讲者的改进的自我监督语音表示

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

论文作者

Qian, Kaizhi, Zhang, Yang, Gao, Heting, Ni, Junrui, Lai, Cheng-I, Cox, David, Hasegawa-Johnson, Mark, Chang, Shiyu

论文摘要

语音中的自我监督学习涉及在大规模的未经注释的语音语料库上训练语音表示网络，然后将学习的表示形式应用于下游任务。由于语音中SSL学习的大多数下游任务主要集中在语音中的内容信息上，因此最理想的语音表示形式应该能够将不需要的变化（例如扬声器的变化）从内容中删除。但是，解开扬声器非常具有挑战性，因为删除说话者的信息也很容易导致内容丢失，而后者的损害通常远远超过了前者的好处。在本文中，我们提出了一种新的SSL方法，该方法可以实现扬声器的分离而不会严重丢失内容。我们的方法是根据休伯特框架改编的，并结合了解开机制，以使教师标签和博学的代表规范化。我们在一组与内容相关的下游任务上评估了说话者分解的好处，并观察到我们的说话者示词表示的一致且著名的性能优势。

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题