动态潜伏语音识别以及异步修订

论文标题

动态潜伏语音识别以及异步修订

Dynamic latency speech recognition with asynchronous revision

论文作者

Huang, Mingkun, Cai, Meng, Zhang, Jun, Zhang, Yang, You, Yongbin, He, Yi, Ma, Zejun

论文摘要

在这项工作中，我们提出了一种推理技术，异步修订，以统一流和非流语音识别模型。具体而言，我们仅在推理过程中使用任意权利上下文，仅使用一个模型实现动态延迟。该模型由用于音频编码的一堆卷积层组成。在推论阶段，可以对编码器和解码器的历史状态进行异步修订以在模型的延迟和准确性之间进行权衡。为了减轻培训和推理不匹配，我们提出了一种培训技术，分段裁剪，该技术将输入话语随机分为具有正向连接的几段。这使我们能够获得动态的延迟语音识别结果，并具有很大的准确性。实验表明，我们使用异步修订的动态延迟模型可为流模型提供8 \％-14 \％的相对改进。

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Specifically, we achieve dynamic latency with only one model by using arbitrary right context during inference. The model is composed of a stack of convolutional layers for audio encoding. In inference stage, the history states of encoder and decoder can be asynchronously revised to trade off between the latency and the accuracy of the model. To alleviate training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. This allows us to have dynamic latency speech recognition results with large improvements in accuracy. Experiments show that our dynamic latency model with asynchronous revision gives 8\%-14\% relative improvements over the streaming models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题