知识从大规模预处理的语言模型转移到端到端的语音识别器

论文标题

知识从大规模预处理的语言模型转移到端到端的语音识别器

Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers

论文作者

Kubo, Yotaro, Karita, Shigeki, Bacchiani, Michiel

论文摘要

端到端的语音识别是一种有前途的技术，用于实现紧凑的自动语音识别（ASR）系统，因为它可以将声学和语言模型统一为单个神经网络。但是，作为一个缺点，对端到端语音识别者的培训总是需要抄写的话语。由于端到端的模型也很严重饥饿，因此这种约束至关重要，尤其是因为获得转录的话语是昂贵的，可能是不切实际的或可能是不切实际的或不可能的。本文提出了一种通过从语言模型神经网络中转移知识来减轻此问题的方法，该知识可以通过仅文本数据审慎。具体而言，本文试图转移在嵌入大型语言模型的向量中获得的语义知识。由于嵌入向量可以被假定为语言信息的隐式表示，例如言论部分，意图等，因此这些信息也被期望为ASR解码器的有用建模提示。本文通过修改训练损失功能以包括嵌入预测项，扩展了两种类型的ASR解码器，基于注意的解码器和神经传感器。所提出的系统被证明可有效降低错误率，而不会在解码阶段产生额外的计算成本。

End-to-end speech recognition is a promising technology for enabling compact automatic speech recognition (ASR) systems since it can unify the acoustic and language model into a single neural network. However, as a drawback, training of end-to-end speech recognizers always requires transcribed utterances. Since end-to-end models are also known to be severely data hungry, this constraint is crucial especially because obtaining transcribed utterances is costly and can possibly be impractical or impossible. This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data. Specifically, this paper attempts to transfer semantic knowledge acquired in embedding vectors of large-scale language models. Since embedding vectors can be assumed as implicit representations of linguistic information such as part-of-speech, intent, and so on, those are also expected to be useful modeling cues for ASR decoders. This paper extends two types of ASR decoders, attention-based decoders and neural transducers, by modifying training loss functions to include embedding prediction terms. The proposed systems were shown to be effective for error rate reduction without incurring extra computational costs in the decoding phase.

下载PDF全文

下载文献需遵守相关版权规定

论文标题