论文标题
端到端的多重量语音识别的层面快速改编
Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
论文作者
论文摘要
重音变异性对自动语音识别〜(ASR)建模构成了巨大挑战。尽管通常使用了基于一种高调矢量的适应系统,但它们需要有关目标口音的先验知识,并且无法处理看不见的口音。此外,简单地将重音嵌入嵌入并不能很好地利用其重音知识,而这些知识的进步有限。在这项工作中,我们旨在通过注入E2E ASR模型编码器的新型层适应结构来解决这些问题。适配器层在口音空间中编码任意重音,并协助ASR模型识别强调语音。鉴于话语,适应结构通过所有重音碱的线性组合提取相应的重音信息,并将输入声特征转化为与重音相关的特征。我们进一步探讨了适应层的注射位置,口音底座的数量和不同类型的口音碱,以实现更好的重音适应。实验结果表明,与基线相比,所提出的适应结构分别带来了AESRC2020 Accent数据集和LibrisPeech数据集的12 \%和10 \%相对单词错误率〜(WER)的降低。
Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12\% and 10\% relative word error rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline.