基于WAV2VEC2和转移学习的孟加拉语语言的自动语音识别系统

论文标题

基于WAV2VEC2和转移学习的孟加拉语语言的自动语音识别系统

An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning

论文作者

Showrav, Tushar Talukder

论文摘要

一种独立的自动解码和转录口服语音方法称为自动语音识别（ASR）。典型的ASR系统从音频录制或流中提取功能，并运行一种或多种算法以将功能映射到相应的文本。近年来，在语音信号处理领域进行了许多研究。当获得足够的资源时，常规的ASR和新兴的端到端（E2E）语音识别都产生了有希望的结果。但是，对于像孟加拉这样的低资源语言，ASR的当前状态落后于落后，尽管低资源状态并没有反映出这一语言在世界各地有超过5亿人使用的事实。尽管它很受欢迎，但并没有很多可用的开源数据集，因此很难对孟加拉语语音识别系统进行研究。本文是名为“ Buet CSE Fest DL Sprint”的比赛的一部分。本文的目的是通过基于转移学习框架在E2E结构上采用语音识别技术来提高孟加拉语的语音识别表现。提出的方法有效地对孟加拉语进行了建模，并在7747个样本的测试数据集上以“ Levenshtein平均距离”的“平均距离”进行了3.819分数，而仅使用1000个火车数据集样本进行培训。

An independent, automated method of decoding and transcribing oral speech is known as automatic speech recognition (ASR). A typical ASR system extracts feature from audio recordings or streams and run one or more algorithms to map the features to corresponding texts. Numerous of research has been done in the field of speech signal processing in recent years. When given adequate resources, both conventional ASR and emerging end-to-end (E2E) speech recognition have produced promising results. However, for low-resource languages like Bengali, the current state of ASR lags behind, although the low resource state does not reflect upon the fact that this language is spoken by over 500 million people all over the world. Despite its popularity, there aren't many diverse open-source datasets available, which makes it difficult to conduct research on Bengali speech recognition systems. This paper is a part of the competition named `BUET CSE Fest DL Sprint'. The purpose of this paper is to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework. The proposed method effectively models the Bengali language and achieves 3.819 score in `Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.

下载PDF全文

下载文献需遵守相关版权规定

论文标题