流式语音识别的高性能顺序到序列模型

论文标题

流式语音识别的高性能顺序到序列模型

High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

论文作者

Nguyen, Thai-Son, Pham, Ngoc-Quan, Stueker, Sebastian, Waibel, Alex

论文摘要

最近，序列到序列模型已开始在批处理模式下处理音频数据时，即在启动处理时可用完整的音频数据来实现标准语音识别任务的最新性能。但是，当涉及到在产生识别的同时进行识别的输入流进行跑步识别时，这些模型面临着几个挑战。对于许多技术，需要在处理开始时进行解码的整个音频序列，例如注意机制或双向LSTM（BLSTM）。在本文中，我们提出了几种减轻这些问题的技术。我们引入了一个额外的损失函数，该功能控制着注意机制的不确定性，修改的光束搜索识别部分，稳定的假设，与编码器中的BLSTM合作的方式以及使用块BLSTM的使用。我们的实验表明，通过这些技术的正确组合，可以在不牺牲单词错误率性能的情况下以低基于单词的延迟进行跑步的语音识别。

Recently sequence-to-sequence models have started to achieve state-of-the-art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to performing run-on recognition on an input stream of audio data while producing recognition results in real-time and with low word-based latency, these models face several challenges. For many techniques, the whole audio sequence to be decoded needs to be available at the start of the processing, e.g., for the attention mechanism or the bidirectional LSTM (BLSTM). In this paper, we propose several techniques to mitigate these problems. We introduce an additional loss function controlling the uncertainty of the attention mechanism, a modified beam search identifying partial, stable hypotheses, ways of working with BLSTM in the encoder, and the use of chunked BLSTM. Our experiments show that with the right combination of these techniques, it is possible to perform run-on speech recognition with low word-based latency without sacrificing in word error rate performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题