语音识别的多旗传感器

论文标题

语音识别的多旗传感器

Multi-blank Transducers for Speech Recognition

论文作者

Xu, Hainan, Jia, Fei, Majumdar, Somshubra, Watanabe, Shinji, Ginsburg, Boris

论文摘要

本文提出了对自动语音识别（ASR）的RNN-TransDucer（RNN-T）模型的修改。在标准RNN-T中，空白符号的发射完全消耗一个输入框架；在我们提出的方法中，我们引入了其他空白符号，该符号在发射时会消耗两个或多个输入帧。我们将添加的符号称为大空白，而方法多旗RNN-T。为了训练多旗RNN-TS，我们提出了一种新型的logit不正当化方法，以便优先考虑大空白的排放。通过对多种语言和数据集进行实验，我们表明多蓝色RNN-T方法可以将相对速度超过+90％/ +139％带来模型，以模拟英语librispeech和德语多语言librispeech数据集的推断。多旗RNN-T方法还始终提高ASR的精度。我们将在NEMO（https://github.com/nvidia/nemo）工具包中发布该方法的实现。

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题