论文标题
低资源印度语言的注释语料库:Awadhi,Bhojpuri,Braj和Magahi
Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi
论文作者
论文摘要
在本文中,我们使用语言数据收集的现场方法讨论了针对四种低资源印度雅利语语言开发语料库的过程中的工作。目前,语料库的总大小约为18小时(每种语言大约4-5小时),并用语法信息进行转录和注释,例如词性词性标签,形态学特征和普遍的依赖关系。我们讨论了以这些语言收集数据的方法,其中大多数是在COVID-19大流行中进行的,其中之一是为低收入群体带来这些语言的额外收入。在本文中,我们还讨论了这些语言中自动语音识别系统的基线实验的结果。
In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.