变形器：耦合变形的局部模式与全局上下文，以实现稳健的端到端语音识别

论文标题

变形器：耦合变形的局部模式与全局上下文，以实现稳健的端到端语音识别

DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition

论文作者

Xie, Jiamin, Hansen, John H. L.

论文摘要

卷积神经网络（CNN）通过利用局部时间频率模式可以极大地提高语音识别性能。但是，这些模式被认为是通过常规CNN操作以对称和刚性内核出现的。它激发了一个问题：不对称内核呢？在这项研究中，我们说明自适应视图可以发现与固定意见相比，这些特征与固定观点更好。我们用可变形的对应物替换了构象异构体架构中的深度CNN，称为“变形器”。通过分析我们表现最佳的模型，我们可以可视化变形者所学到的本地接受场和全球注意力图，并在话语层面上显示出更多的特征关联。对学习内核偏移的统计分析提供了对网络深度功能中信息变化的见解。最后，仅替换编码器中的一半层，变形器在没有LM的情况下提高了 +5.6％的相对WER，而在WSJ estion92集合上的构象基线上，LM的相对相对 +6.4％。

Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can discover local features which couple better with attention than fixed views of the input. We replace depthwise CNNs in the Conformer architecture with a deformable counterpart, dubbed this "Deformer". By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level. The statistical analysis of learned kernel offsets provides an insight into the change of information in features with the network depth. Finally, replacing only half of the layers in the encoder, the Deformer improves +5.6% relative WER without a LM and +6.4% relative WER with a LM over the Conformer baseline on the WSJ eval92 set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题