伪标签比人类标签更好

论文标题

伪标签比人类标签更好

Pseudo Label Is Better Than Human Label

论文作者

Hwang, Dongseong, Sim, Khe Chai, Huo, Zhouyuan, Strohman, Trevor

论文摘要

最先进的自动语音识别（ASR）系统经过数以万计的标记语音数据训练。人类转录很昂贵且耗时。诸如转录的质量和一致性之类的因素可以极大地影响使用这些数据训练的ASR模型的性能。在本文中，我们表明我们可以通过利用最近的自学和半监督学习技术来培训强大的教师模型来生产高质量的伪标签。具体而言，我们仅使用（无监督/监督培训）和迭代嘈杂的学生教师培训来培训6亿个参数双向教师模型。该模型在语音搜索任务上达到了4.0％的单词错误率（WER），比基线相对好11.1％。我们进一步表明，通过使用这种强大的教师模型来生成用于训练的高质量伪标签，与使用人类标签相比，流媒体模型可以实现13.6％的相对减少（5.9％至5.1％）。

State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce high quality pseudo labels by utilizing recent self-supervised and semi-supervised learning techniques. Specifically, we use JUST (Joint Unsupervised/Supervised Training) and iterative noisy student teacher training to train a 600 million parameter bi-directional teacher model. This model achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively better than a baseline. We further show that by using this strong teacher model to generate high-quality pseudo labels for training, we can achieve 13.6% relative WER reduction (5.9% to 5.1%) for a streaming model compared to using human labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题