在音素识别上的变压器，卷积和经常性神经网络的比较

论文标题

在音素识别上的变压器，卷积和经常性神经网络的比较

A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

论文作者

Shim, Kyuhong, Sung, Wonyong

论文摘要

音素识别是语音识别的非常重要的一部分，它需要从多个帧中提取语音特征的能力。在本文中，我们使用音素识别来比较和分析CNN，RNN，变压器和构象模型。对于CNN，上下文网络模型用于实验。首先，我们比较不同约束的各种体系结构的准确性，例如接受场长度，参数大小和层深度。其次，我们解释了这些模型的性能差异，尤其是当可观察到的序列长度变化时。我们的分析表明，变压器和构象模型受益于通过输入帧的自我注意力的远程可及性。

Phoneme recognition is a very important part of speech recognition that requires the ability to extract phonetic features from multiple frames. In this paper, we compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. For CNN, the ContextNet model is used for the experiments. First, we compare the accuracy of various architectures under different constraints, such as the receptive field length, parameter size, and layer depth. Second, we interpret the performance difference of these models, especially when the observable sequence length varies. Our analyses show that Transformer and Conformer models benefit from the long-range accessibility of self-attention through input frames.

下载PDF全文

下载文献需遵守相关版权规定

论文标题