自动语音识别的内容 - 字幕分级表示

论文标题

自动语音识别的内容 - 字幕分级表示

Content-Context Factorized Representations for Automated Speech Recognition

论文作者

Chan, David M., Ghosh, Shalini

论文摘要

深度神经网络在很大程度上证明了他们通过从输入音频帧中提取有意义的功能来执行自动语音识别（ASR）的能力。但是，此类功能不仅包括有关口头语言内容的信息，而且还可能包含有关背景噪声和声音或说话者身份，口音或受保护属性等不必要上下文的信息。这样的信息可以通过引入口语和说话的上下文之间的虚假相关性来直接损害概括性能。在这项工作中，我们介绍了一种无监督的，编码的方法，用于将语音编码器描述为明确的内容编码表示和虚假的上下文编码表示表示。通过这样做，我们证明了标准ASR基准的性能提高了，并且在现实世界和人为嘈杂的ASR方案中的性能得到了改善。

Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题