广东话中的自动语音识别数据集：调查和新数据集

论文标题

广东话中的自动语音识别数据集：调查和新数据集

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

论文作者

Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung Shadow, Lovenia, Holy, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., Fung, Pascale

论文摘要

低资源语言上的自动语音识别（ASR）改善了语言少数群体对人工智能（AI）提供的技术优势的访问。在本文中，我们通过创建一个新的广东数据集来解决香港广东话的数据稀缺问题。我们的数据集，多域广东话语料库（MDCC），由73.6个小时的干净阅读语音与成绩单配对，并从香港的广东旁有声读物中收集。它包括哲学，政治，教育，文化，生活方式和家庭领域，涵盖了广泛的主题。我们还审查了所有现有的广东数据集，并根据其语音类型，数据源，总规模和可用性进行分析。我们在最大的现有数据集，常见的语音ZH-HK和我们提出的MDCC上进一步使用Fairseq S2T Transformer（一种最先进的ASR模型）进行实验，结果显示了我们数据集的有效性。此外，我们通过在MDCC和Common Voice ZH-HK上应用多数据库学习来创建一个强大而强大的广东话ASR模型。

Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

下载PDF全文

下载文献需遵守相关版权规定

论文标题