论文标题

用于语音识别和合成的结构化状态空间解码器

Structured State Space Decoder for Speech Recognition and Synthesis

论文作者

Miyazaki, Koichi, Murata, Masato, Koriyama, Tomoki

论文摘要

近年来开发的自动语音识别(ASR)系统显示了自我发场模型(例如变压器和构象体)的有希望的结果,这些模型正在取代传统的经常性神经网络。同时,最近提出了一个结构化状态空间模型(S4),为各种长期序列建模任务(包括原始语音分类)产生了令人鼓舞的结果。 S4模型可以并行训练,与变压器模型相同。在这项研究中,我们通过将S4与变压器解码器进行比较,将S4作为ASR和文本到语音(TTS)任务的解码器。对于ASR任务,我们的实验结果表明,所提出的模型在LibrisPeech测试清洁/测试中的竞争性单词错误率(WER)为1.88%/4.25%,并且CSJ eval1/Eval3/Eval3集合中的特定测试清洁/测试套件和3.80%/2.63%/2.98%的角色错误率(CER)为。此外,所提出的模型比标准变压器模型更强大,尤其是对于两个数据集的长格式语音。对于TTS任务,所提出的方法的表现优于变压器基线。

Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25% on LibriSpeech test-clean/test-other set and a character error rate (CER) of 3.80%/2.63%/2.98% on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed model is more robust than the standard Transformer model, particularly for long-form speech on both the datasets. For the TTS task, the proposed method outperforms the Transformer baseline.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源