FusionFormer：在变压器中融合操作，以进行有效的流语音识别

论文标题

FusionFormer：在变压器中融合操作，以进行有效的流语音识别

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

论文作者

Song, Xingchen, Wu, Di, Zhang, Binbin, Wu, Zhiyong, Li, Wenpeng, Li, Dongfang, Zhang, Pengshen, Peng, Zhendong, Pan, Fuping, Zhu, Changbao, Wu, Zhongqin

论文摘要

最近提出的构象体结构将卷积与关注以捕获本地和全局依赖关系的关注已成为自动语音识别〜（ASR）的\ textit {de extial}骨干模型。该体系结构从自然语言处理（NLP）任务继承，将图层归一化〜（LN）作为默认标准化技术。但是，通过一系列系统的研究，我们发现LN可能需要10 \％的推理时间，尽管它仅造成0.1 \％的拖鞋。这激发了我们用其他标准化技术替换LN，例如批处理归一化〜（bn），以借助操作员融合方法的帮助加快推理，并避免在推理过程中计算平均值和方差统计。在检查了几次直接删除所有LN层或用BN在同一位置替换它们的普通尝试之后，我们发现差异问题主要由不稳定的层输出引起。因此，我们建议将BN层附加到观察到稳定训练结果的每个线性或卷积层。我们还建议通过用relu代替构象异构体（例如swish和glu）中的激活。所有这些交换的模块都可以融合到相邻线性/卷积层的重量中，因此推断成本为零。因此，我们将其命名为FusionFormer。我们的实验表明，融合形式与基于LN的构象异构体一样有效，速度约为10 \％。

The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization~(LN) as a default normalization technique. However, through a series of systematic studies, we find that LN might take 10\% of the inference time despite that it only contributes to 0.1\% of the FLOPs. This motivates us to replace LN with other normalization techniques, e.g., Batch Normalization~(BN), to speed up inference with the help of operator fusion methods and the avoidance of calculating the mean and variance statistics during inference. After examining several plain attempts which directly remove all LN layers or replace them with BN in the same place, we find that the divergence issue is mainly caused by the unstable layer output. We therefore propose to append a BN layer to each linear or convolution layer where stabilized training results are observed. We also propose to simplify the activations in Conformer, such as Swish and GLU, by replacing them with ReLU. All these exchanged modules can be fused into the weights of the adjacent linear/convolution layers and hence have zero inference cost. Therefore, we name it FusionFormer. Our experiments indicate that FusionFormer is as effective as the LN-based Conformer and is about 10\% faster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题