论文标题
语音识别的多旗传感器
Multi-blank Transducers for Speech Recognition
论文作者
论文摘要
本文提出了对自动语音识别(ASR)的RNN-TransDucer(RNN-T)模型的修改。在标准RNN-T中,空白符号的发射完全消耗一个输入框架;在我们提出的方法中,我们引入了其他空白符号,该符号在发射时会消耗两个或多个输入帧。我们将添加的符号称为大空白,而方法多旗RNN-T。为了训练多旗RNN-TS,我们提出了一种新型的logit不正当化方法,以便优先考虑大空白的排放。通过对多种语言和数据集进行实验,我们表明多蓝色RNN-T方法可以将相对速度超过+90%/ +139%带来模型,以模拟英语librispeech和德语多语言librispeech数据集的推断。多旗RNN-T方法还始终提高ASR的精度。我们将在NEMO(https://github.com/nvidia/nemo)工具包中发布该方法的实现。
This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.