MMSpeech：多模式多任务编码器编码器预训练，以识别语音识别

论文标题

MMSpeech：多模式多任务编码器编码器预训练，以识别语音识别

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

论文作者

Zhou, Xiaohuan, Wang, Jiaming, Cui, Zeyu, Zhang, Shiliang, Yan, Zhijie, Zhou, Jingren, Zhou, Chang

论文摘要

在本文中，我们提出了一种新型的多模式多任务编码器编码器预训练框架（MMSpeech），用于普通话自动语音识别（ASR），该框架使用未标记的语音数据和文本数据。语音文本联合预训练的主要困难来自语音和文本方式之间的显着差异，尤其是对于普通话语音和文本而言。与英语和其他具有字母写作系统的语言不同，普通话使用了意识形态写作系统，在该系统中，角色和声音不会彼此紧密地映射。因此，我们建议将音素模态引入预训练中，这可以帮助捕获普通话语音和文本之间的模态不变信息。具体而言，我们采用了一个多任务学习框架，包括五个具有语音和文本数据的自我监督和监督任务。对于端到端的预训练，我们介绍了使用无标记的语音和文本数据的自我监督语音到伪编码（S2C）和音素对文本（P2T）任务，其中语音伪编码配对和语音词与音素配对是对监督的语音 - 语音 - 音素的补充。为了训练编码器以学习更好的语音表示，我们介绍了自我监督的掩盖语音预测（MSP）和监督音素预测（PP）任务，以学习将语音映射到音素中。此外，我们将下游监督的语音到文本（S2T）任务直接添加到训练过程中，即使没有微调即使没有进行微调，这也可以进一步提高训练前的绩效并获得更好的识别结果。 Aishell-1上的实验表明，我们提出的方法可实现最先进的性能，与其他预训练方法相比，相对改善超过40％。

In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题