论文标题
CASCADE RNN-TRANSDUCER:基于音节的流媒体普通话识别音节转换器
Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter
论文作者
论文摘要
由于其简化的系统结构和出色的性能,因此在自动语音识别(ASR)中受到端到端模型的青睐。在这些模型中,复发性神经网络传感器(RNN-T)由于其高临界性和低延迟性而在流媒体上的语音识别方面取得了重大进展。 RNN-T采用预测网络来增强语言信息,但其语言建模能力有限,因为它仍然需要配对的语音文本数据进行训练。通过额外的文本数据(例如与外部语言模型的浅融合)进一步增强语言建模能力,只会带来少量的性能增长。鉴于普通话是一种基于角色的语言,并且每个字符被称为音调音节,因此本文提出了一种新型的Cascade RNN-T方法,以提高RNN-T的语言建模能力。我们的方法首先使用RNN-T将声学特征转换为音节序列,然后通过基于RNN-T的音节到字符转换器将音节序列转换为字符序列。因此,可以轻松地使用丰富的文本存储库来增强语言模型能力。通过引入几个重要技巧,Cascade RNN-T方法在几个普通话测试集上具有很大的利润,超过了基于角色的RNN-T,具有更高的识别质量和类似的延迟。
End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.