论文标题
长时间对话的语音识别和多演讲者诊断
Speech Recognition and Multi-Speaker Diarization of Long Conversations
论文作者
论文摘要
传统上,对语音识别(ASR)和说话者诊断(SD)模型进行了单独培训,以使用扬声器标签生产丰富的对话笔录。最近的进步表明,联合ASR和SD模型可以学会利用音频循环依赖性来改善单词诊断性能。我们介绍了从每周的美国生活广播节目中收集的长达一个小时的播客的新基准,以更好地比较这些方法时,将这些方法应用于扩展的多演讲者对话。我们发现,当知道话语边界时,训练单独的ASR和SD模型的表现更好,但其他联合模型可以表现更好。为了通过未知的话语边界处理长时间的对话,我们引入了一种引人注目的注意解码算法和数据增强技术,这些算法与模型预训练相结合,改善了ASR和SD。
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.