级联编码器，用于统一流和非流动ASR

论文标题

级联编码器，用于统一流和非流动ASR

Cascaded encoders for unifying streaming and non-streaming ASR

论文作者

Narayanan, Arun, Sainath, Tara N., Pang, Ruoming, Yu, Jiahui, Chiu, Chung-Cheng, Prabhavalkar, Rohit, Variani, Ehsan, Strohman, Trevor

论文摘要

端到端（E2E）自动语音识别（ASR）模型到目前为止，已经在几个基准上显示了竞争性能。这些模型的结构是在流或非流程模式下运行。这项工作提出了级联的编码器，用于构建单个E2E ASR模型，该模型可以同时在这两种模式下运行。所提出的模型包括流和非流式编码器。输入功能首先是由流编码器处理的；非流程编码器专门在流式编码器的输出上运行。然后，单个解码器学会使用流媒体的输出或非流式编码器来解码。结果表明，该模型在流媒体模式下运行时将相似的单词错误率（WER）作为独立流媒体模型，在非流传输模式下运行时可获得10％-27％的相对改进。我们的结果还表明，所提出的方法的表现优于现有的E2E两次循环模型，尤其是在长期语音上。

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题