论文标题
利用真实的对话数据进行多通道连续语音分离
Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation
论文作者
论文摘要
现有的多通道连续语音分离(CSS)模型在很大程度上取决于监督数据 - 模拟数据会导致培训和实际数据测试之间的数据不匹配,或者实际转录重叠数据之间的数据不匹配,这很难获取,这会阻碍对话/会议转录转录任务的进一步改进。在本文中,我们为CSS模型提出了一个三阶段的培训计划,该计划可以利用受监管的数据和大型无监督的真实对话数据。该方案由两种常规培训方法组成 - 使用模拟数据和基于ASR的数据使用转录数据进行预训练 - 以及两者之间一种新颖的连续半监督训练,其中CSS模型通过基于教师学生学习框架的实际数据进一步培训。我们将此方案应用于阵列 - 几何形状不可能的CSS模型,该模型可以使用从任何麦克风阵列中收集的多渠道数据。在Microsoft内部会议数据和AMI会议语料库上,进行了大规模的会议转录实验。已经观察到每个训练阶段的稳定改善,显示了提出的方法的效果,该方法可以利用CSS模型训练的真实对话数据。
Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlapping data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data. The scheme consists of two conventional training approaches -- pre-training using simulated data and ASR-loss-based training using transcribed data -- and a novel continuous semi-supervised training between the two, in which the CSS model is further trained by using real data based on the teacher-student learning framework. We apply this scheme to an array-geometry-agnostic CSS model, which can use the multi-channel data collected from any microphone array. Large-scale meeting transcription experiments are carried out on both Microsoft internal meeting data and the AMI meeting corpus. The steady improvement by each training stage has been observed, showing the effect of the proposed method that enables leveraging real conversational data for CSS model training.