用于语音识别和合成的结构化状态空间解码器

论文标题

用于语音识别和合成的结构化状态空间解码器

Structured State Space Decoder for Speech Recognition and Synthesis

论文作者

Miyazaki, Koichi, Murata, Masato, Koriyama, Tomoki

论文摘要

近年来开发的自动语音识别（ASR）系统显示了自我发场模型（例如变压器和构象体）的有希望的结果，这些模型正在取代传统的经常性神经网络。同时，最近提出了一个结构化状态空间模型（S4），为各种长期序列建模任务（包括原始语音分类）产生了令人鼓舞的结果。 S4模型可以并行训练，与变压器模型相同。在这项研究中，我们通过将S4与变压器解码器进行比较，将S4作为ASR和文本到语音（TTS）任务的解码器。对于ASR任务，我们的实验结果表明，所提出的模型在LibrisPeech测试清洁/测试中的竞争性单词错误率（WER）为1.88％/4.25％，并且CSJ eval1/Eval3/Eval3集合中的特定测试清洁/测试套件和3.80％/2.63％/2.98％的角色错误率（CER）为。此外，所提出的模型比标准变压器模型更强大，尤其是对于两个数据集的长格式语音。对于TTS任务，所提出的方法的表现优于变压器基线。

Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98% on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed model is more robust than the standard Transformer model, particularly for long-form speech on both the datasets. For the TTS task, the proposed method outperforms the Transformer baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题