使用自我监督的语音模型为ASR进行无监督的微调数据选择

论文标题

使用自我监督的语音模型为ASR进行无监督的微调数据选择

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

论文作者

Gody, Reem, Harwath, David

论文摘要

当我们仅访问少量转录的语音数据时，自我监督学习（SSL）已经能够利用未标记的数据来提高自动语音识别（ASR）模型的性能。但是，这提出了一个问题，即应选择可用的未标记数据的哪个子集进行转录。我们的工作调查了在有限的转录预算下对休伯特模型进行微调的不同无监督的数据选择技术。我们研究了说话者多样性，性别偏见和主题多样性对下游ASR性能的影响。我们还为无监督的数据选择设计了两种新型技术：基于训练的数据选择和字节对编码群集单元（PBPE）的困惑，我们展示了这些技术与纯随机数据选择的比较。最后，我们分析了所选微调子集的固有特征之间的相关性，以及这些特征与结果单词错误率的相关性。我们证明了令牌多样性，说话者多样性和主题多样性在实现最佳性能方面的重要性。

Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget. We investigate the impact of speaker diversity, gender bias, and topic diversity on the downstream ASR performance. We also devise two novel techniques for unsupervised data selection: pre-training loss based data selection and the perplexity of byte pair encoded clustered units (PBPE) and we show how these techniques compare to pure random data selection. Finally, we analyze the correlations between the inherent characteristics of the selected fine-tuning subsets as well as how these characteristics correlate with the resultant word error rate. We demonstrate the importance of token diversity, speaker diversity, and topic diversity in achieving the best performance in terms of WER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题