论文标题
端点端到端多通话器ASR的端点检测
Endpoint Detection for Streaming End-to-End Multi-talker ASR
论文作者
论文摘要
流媒体端到端的多对话者语音识别旨在以流媒体方式转录与全神经模型的对话或会议的重叠语音,这与基于模块化的方法的根本不同,该方法通常会层叠演讲分离和语音识别模型。以前,我们提出了基于此问题的复发神经网络传感器(RNN-T)的流式透明和识别传感器(SURT)模型,并提出了令人鼓舞的结果。但是,对于实际应用,还需要语音识别系统来确定扬声器完成迅速系统响应的讲话时的时间戳。此问题称为端点(EP)检测,先前尚未针对多对词的端到端模型进行研究。在这项工作中,我们按照单程论来者的端到端模型将Surt框架中的EP检测问题作为输出单元来解决EP检测问题。此外,我们还提出了一种延迟惩罚方法,可以大大减少EP检测潜伏期。我们基于2扬声器LibrisPeechMix数据集的实验结果表明,Surt模型可以实现有希望的EP检测,而不会显着降低识别精度。
Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing an end-of-sentence token as an output unit, following the practice of single-talker end-to-end models. Furthermore, we also present a latency penalty approach that can significantly cut down the EP detection latency. Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.