将Connectionist的时间汇总添加到符合物中，以提高其解码器效率以识别语音

论文标题

将Connectionist的时间汇总添加到符合物中，以提高其解码器效率以识别语音

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

论文作者

Wang, Nick J. C., Quan, Zongfeng, Wang, Shaojun, Xiao, Jing

论文摘要

构象模型是语音识别模型的出色体系结构，可有效利用连接派时间分类（CTC）的混合损失和对火车模型参数的关注。为了提高构象异构体的解码效率，我们提出了一种新型的连接派时间摘要（CTS）方法，该方法减少了从编码器产生的声学序列所需的注意解码器所需的帧数，从而减少了操作。但是，为了实现这种解码的改进，我们必须微调模型参数，因为更改了交叉注意观测，因此需要相应的改进。我们的最终实验表明，在梁宽为4的光束宽度为4的情况下，LibrisPeech的解码预算可以降低20％，对于FluentsPeech数据，可以将其降低11％，而不会丢失ASR准确性。对于LibrisPeech“测试”集，甚至可以提高准确性。在梁宽度为1的梁宽度下，单词错误率（WER）降低了6 \％相对，在梁宽度为4的相对相对相对3％。

The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic sequences generated by the encoder, thus reducing operations. However, to achieve such decoding improvements, we must fine-tune model parameters, as cross-attention observations are changed and thus require corresponding refinements. Our final experiments show that, with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR accuracy. An improvement in accuracy is even found for the LibriSpeech "test-other" set. The word error rate (WER) is reduced by 6\% relative at the beam width of 1 and by 3% relative at the beam width of 4.

下载PDF全文

下载文献需遵守相关版权规定

论文标题