Censer：基于自我监督的预培训的课程半监督学习语音识别

论文标题

Censer：基于自我监督的预培训的课程半监督学习语音识别

Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training

论文作者

Zhang, Bowen, Cao, Songjun, Zhang, Xiaoming, Zhang, Yike, Ma, Long, Shinozaki, Takahiro

论文摘要

最近的研究表明，自我监管的预训练和自我训练（伪标记）提供的好处是互补的。但是，在培训前框架下的半监督微调策略仍未充分研究。此外，现代的半监督语音识别算法要么不加区别地处理未标记的数据，要么以置信度阈值过滤噪音样本。不同标记的数据之间的差异通常被忽略。在本文中，我们提出了基于自我监督的预训练的半监督语音识别算法，以最大程度地利用未标记的数据。 Censer的预训练阶段采用WAV2VEC2.0，微调阶段采用了Slimipl的改进的半监督学习算法，该算法根据其伪标签的质量逐渐利用了未标记的数据。我们还结合了一个时间伪标签池和指数的移动平均线，以控制伪标签的更新频率并避免模型发散。与现有方法相比，关于Libri-Light和Librispeech数据集的实验结果表明，我们所提出的方法在更统一的同时取得了更好的性能。

Recent studies have shown that the benefits provided by self-supervised pre-training and self-training (pseudo-labeling) are complementary. Semi-supervised fine-tuning strategies under the pre-training framework, however, remain insufficiently studied. Besides, modern semi-supervised speech recognition algorithms either treat unlabeled data indiscriminately or filter out noisy samples with a confidence threshold. The dissimilarities among different unlabeled data are often ignored. In this paper, we propose Censer, a semi-supervised speech recognition algorithm based on self-supervised pre-training to maximize the utilization of unlabeled data. The pre-training stage of Censer adopts wav2vec2.0 and the fine-tuning stage employs an improved semi-supervised learning algorithm from slimIPL, which leverages unlabeled data progressively according to their pseudo labels' qualities. We also incorporate a temporal pseudo label pool and an exponential moving average to control the pseudo labels' update frequency and to avoid model divergence. Experimental results on Libri-Light and LibriSpeech datasets manifest our proposed method achieves better performance compared to existing approaches while being more unified.

下载PDF全文

下载文献需遵守相关版权规定

论文标题