使用树受到限制的指针生成器，将上下文ASR的有偏见错误最小化

论文标题

使用树受到限制的指针生成器，将上下文ASR的有偏见错误最小化

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

论文作者

Sun, Guangzhi, Zhang, Chao, Woodland, Philip C

论文摘要

上下文知识对于减少高价值长尾词的语音识别错误至关重要。本文提出了一个新颖的树受限指针生成器（TCPGEN）组件，该组件使端到端的ASR模型偏向使用外部上下文信息获得的长尾词列表。在记忆使用和计算成本方面，TCPGEN只有一个小开销，可以将数千个有偏见的单词有效地构造成符号前树，并在树和最终的ASR输出之间创建神经快捷方式，以促进对偏见单词的识别。为了增强TCPGEN，我们进一步提出了一种新颖的最小偏见单词错误（MBWE）损失，该损失在训练过程中直接优化了偏见的单词错误，并在测试过程中采用了偏见的语言模型折现（BLMD）方法。所有上下文ASR系统均在公共Librispeech有声读物语料库以及对话状态跟踪挑战（DSTC）的数据上进行评估，并从对话系统本体学中提取的偏置列表进行了评估。使用TCPGEN实现了一致的单词错误率（WER）降低，这在识别错误率约为40 \％相对降低的偏置单词上尤为重要。 MBWE和BLMD进一步提高了TCPGEN的有效性，并在偏见的单词上实现了更大的降低。 TCPGEN还实现了在音频训练中的单词零射击学习，并在偏置列表中的播出量表中降低了大量降低。

Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thousands of biasing words efficiently into a symbolic prefix-tree and creates a neural shortcut between the tree and the final ASR output to facilitate the recognition of the biasing words. To enhance TCPGen, we further propose a novel minimum biasing word error (MBWE) loss that directly optimises biasing word errors during training, along with a biasing-word-driven language model discounting (BLMD) method during the test. All contextual ASR systems were evaluated on the public Librispeech audiobook corpus and the data from the dialogue state tracking challenges (DSTC) with the biasing lists extracted from the dialogue-system ontology. Consistent word error rate (WER) reductions were achieved with TCPGen, which were particularly significant on the biasing words with around 40\% relative reductions in the recognition error rates. MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words. TCPGen also achieved zero-shot learning of words not in the audio training set with large WER reductions on the out-of-vocabulary words in the biasing list.

下载PDF全文

下载文献需遵守相关版权规定

论文标题