低资源视频ASR的大规模弱和半监督学习

论文标题

低资源视频ASR的大规模弱和半监督学习

Large scale weakly and semi-supervised learning for low-resource video ASR

论文作者

Singh, Kritika, Manohar, Vimal, Xiao, Alex, Edunov, Sergey, Girshick, Ross, Liptchinsky, Vitaliy, Fuegen, Christian, Saraf, Yatharth, Zweig, Geoffrey, Mohamed, Abdelrahman

论文摘要

已经研究了许多半半监督的方法，以克服建立高质量语音识别系统的标签成本。关于在低资源条件下抄写社交媒体视频的挑战性任务，我们一方面在两种自我标记方法之间进行了大规模的系统比较，另一方面使用上下文元数据进行了弱监督的预处理。我们使用27,000和58,000小时的未标记的音频，研究了框架级别的蒸馏方法和混合，基于CTC的序列水平以及荷兰语和罗马尼亚语言上的编码器折磨语音识别系统。尽管所有方法都在各自的基线WER上提高了8％以上，但与最强的数据增强监督的基线相比，编码器折叠模型的序列级蒸馏量最大的相对降低为20％。

Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题