英语 - 曼丹林混合代码转换语音识别的单语数据选择分析

论文标题

英语 - 曼丹林混合代码转换语音识别的单语数据选择分析

Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-switching Speech Recognition

论文作者

Zhang, Haobo, Xu, Haihua, Pham, Van Tung, Huang, Hao, Chng, Eng Siong

论文摘要

在本文中，我们在构建英语曼丹林代码转换（CS）语音识别（CSSR）系统时进行了数据选择分析，该系统的目的是在中国进行真正的CSSR竞赛。总体培训集分别有三个子集，即分别是代码转换数据集，英语（LibrisPeech）和普通话数据集。代码切换数据是普通话主导的。首先，发现使用总体数据会产生更差的结果，因此需要数据选择研究。然后，为了利用单语言数据，我们发现数据匹配至关重要。普通话数据与代码转换数据中的普通话部分紧密匹配，而英文数据则不匹配。但是，普通话数据仅有助于那些以普通话为主的话语。此外，还有一个平衡点，更多的单语言数据将转移CSSR系统，从而降低结果。最后，我们分析了将单语言数据与HMM-DNN混合框架训练CSSR系统相结合的有效性。 CSSR系统可以执行内部代码转换识别，但是对代码转换数据进行了训练的数据，它仍然具有差距。

In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found using the overall data yields worse results, and hence data selection study is necessary. Then to exploit monolingual data, we find data matching is crucial. Mandarin data is closely matched with the Mandarin part in the code-switching data, while English data is not. However, Mandarin data only helps on those utterances that are significantly Mandarin-dominated. Besides, there is a balance point, over which more monolingual data will divert the CSSR system, degrading results. Finally, we analyze the effectiveness of combining monolingual data to train a CSSR system with the HMM-DNN hybrid framework. The CSSR system can perform within-utterance code-switch recognition, but it still has a margin with the one trained on code-switching data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题