带有双向解码器的变压器用于语音识别

论文标题

带有双向解码器的变压器用于语音识别

Transformer with Bidirectional Decoder for Speech Recognition

论文作者

Chen, Xi, Zhang, Songyang, Song, Dandan, Ouyang, Peng, Yin, Shouyi

论文摘要

基于注意力的模型最近在端到端自动语音识别（ASR）方面取得了巨大进展。但是，传统的基于变压器的方法通常会通过从左到右的令牌生成序列结果令牌，从而使右下语无法探索。在这项工作中，我们引入了双向语音变压器，以同时利用不同的方向上下文。具体而言，我们提出的变压器的输出包括从左到右的目标和左现在的目标。在推理阶段，我们使用了引入的双向光束搜索方法，该方法不仅可以生成从左到右的候选者，而且可以生成左至左的候选者，并确定得分的最佳假设。为了用双向解码器（STBD）证明我们提出的语音变压器，我们在Aishell-1数据集上进行了广泛的实验。实验的结果表明，STBD在单向语音变压器基线上实现了3.6 \％的相对CER（CERR）。此外，本文中最强的模型称为STBD-big可以在测试集中实现6.64 \％CER，而无需语言模型撤销和任何额外的数据增强策略。

Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently. However, the conventional transformer-based approaches usually generate the sequence results token by token from left to right, leaving the right-to-left contexts unexploited. In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target. In inference stage, we use the introduced bidirectional beam search method, which can not only generate left-to-right candidates but also generate right-to-left candidates, and determine the best hypothesis by the score. To demonstrate our proposed speech transformer with a bidirectional decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The results of experiments show that STBD achieves a 3.6\% relative CER reduction(CERR) over the unidirectional speech transformer baseline. Besides, the strongest model in this paper called STBD-Big can achieve 6.64\% CER on the test set, without language model rescoring and any extra data augmentation strategies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题