论文标题
进行尼日利亚Pidgin自动语音识别的端到端培训
Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin
论文作者
论文摘要
近年来,口语应用程序中自动语音识别(ASR)系统的流行率显着增加。值得注意的是,许多非洲语言缺乏足够的语言资源来支持这些系统的鲁棒性。本文着重于针对尼日利亚Pidgin英语定制的端到端语音识别系统的开发。我们在新数据集上调查并评估了不同预估计的最先进的架构。我们的经验结果表明,在我们的数据集中,变体WAV2VEC2 XLSR-53的表现值得注意,在测试集上达到了29.6%的单词错误率(WER),在定量评估中超过了其他架构,例如Nemo Quartznet和wav2vec2.0 base-100h。此外,我们证明了验证的最先进的体系结构不能很好地运行。我们使用XLSR-English作为基线进行了零射门评估,以与尼日利亚的Pidgin相似。这产生了73.7%的更高WER。通过将此体系结构调整为数据集中表示的细微差别,我们将错误降低了59.84%。我们的数据集包含4,288个记录的话语,这些话语来自10位母语者,分为培训,验证和测试集。这项研究强调了改善尼日利亚Pidgin英语等资源不足语言的ASR系统的潜力,这有助于更大程度地包含在语音技术应用中。我们在尼日利亚的Pidgin上公开发布了我们独特的并行数据集(语音到文本),以及在拥抱脸上的模型权重。我们的代码将可用来培养社区的未来研究。
The prevalence of automatic speech recognition (ASR) systems in spoken language applications has increased significantly in recent years. Notably, many African languages lack sufficient linguistic resources to support the robustness of these systems. This paper focuses on the development of an end-to-end speech recognition system customized for Nigerian Pidgin English. We investigated and evaluated different pretrained state-of-the-art architectures on a new dataset. Our empirical results demonstrate a notable performance of the variant Wav2Vec2 XLSR-53 on our dataset, achieving a word error rate (WER) of 29.6% on the test set, surpassing other architectures such as NEMO QUARTZNET and Wav2Vec2.0 BASE-100H in quantitative assessments. Additionally, we demonstrate that pretrained state-of-the-art architectures do not work well out-of-the-box. We performed zero-shot evaluation using XLSR-English as the baseline, chosen for its similarity to Nigerian Pidgin. This yielded a higher WER of 73.7%. By adapting this architecture to nuances represented in our dataset, we reduce error by 59.84%. Our dataset comprises 4,288 recorded utterances from 10 native speakers, partitioned into training, validation, and test sets. This study underscores the potential for improving ASR systems for under-resourced languages like Nigerian Pidgin English, contributing to greater inclusion in speech technology applications. We publicly release our unique parallel dataset (speech-to-text) on Nigerian Pidgin, as well as the model weights on Hugging Face. Our code would be made available to foster future research from the community.