演讲者适应嘈杂环境中端到端语音识别系统的改编

论文标题

演讲者适应嘈杂环境中端到端语音识别系统的改编

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments

论文作者

Wagner, Dominik, Baumann, Ilja, Bayerl, Sebastian P., Riedhammer, Korbinian, Bocklet, Tobias

论文摘要

我们在不同的噪声条件下分析了基于变压器和WAV2VEC 2.0的端到端自动语音识别模型中说话者适应的影响。通过包括从X-Vector和Ecapa-TDNN系统获得的说话者嵌入以及I-Vector，我们可以在LibrisPeech上实现高达16.3％的相对单词错误率，而在机板上最多可达到14.5％。我们表明，将扬声器向量串联到声学特征并将其作为辅助模型输入提供的验证方法仍然是提高端到端体系结构鲁棒性的可行选择。当将更多的噪声添加到输入语音中时，对变压器模型的影响更强。在中等或没有噪声条件下，基于WAV2VEC 2.0的系统最大的好处。 X-向量和ECAPA-TDNN嵌入都优于I-Questors作为扬声器表示。最佳嵌入尺寸取决于数据集，并且随噪声条件而变化。

We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The optimal embedding size depends on the dataset and also varies with the noise condition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题