违反违规和老年语音识别的个性化对抗数据增强

论文标题

违反违规和老年语音识别的个性化对抗数据增强

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

论文作者

Jin, Zengrui, Geng, Mengzhe, Deng, Jiajun, Wang, Tianzi, Hu, Shujie, Li, Guinan, Liu, Xunying

论文摘要

尽管针对正常语音的自动语音识别（ASR）技术取得了迅速的进展，但迄今为止，准确认识违反障碍和老年语音仍然是高度挑战的任务。由于这些用户中经常发现的移动性问题，很难为ASR系统开发收集大量此类数据。为此，数据增强技术起着至关重要的作用。与现有的数据增强技术相反，仅修改光谱轮廓的语言速率或整体形状，使用本文中的新型扬声器依赖（SD）生成的对抗网络（GAN）基于本文的数据增强方法，对逆转障碍，老年人和正常语音之间的细粒光谱差异进行了建模。这些既可以灵活地允许：a）在可用的语音数据可用时修改时间或速度的正常语音光谱，并更接近受损扬声器的扬声器； b）对于非平行数据，SVD分解了正常的语音频谱基础特征，以转换为目标老年人的人的特征，然后再与时间基础重组以生成最先进的TDNN和构型ASR ASR系统训练的增强数据。实验是针对四个任务进行的：英语Uapseech和Torgo违反语音语音；英国痴呆症皮特和广东话JCCOCC MOCA老年语音数据集。所提出的基于GAN的数据增强方法始终优于基线速度扰动方法，最多可在Torgo和Dementiabank数据上降低4.91％和3.0％的绝对速度（相对相对9.61％和6.4％）。应用基于LHUC的扬声器适应后，保留了一致的性能改进。

Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role. In contrast to existing data augmentation techniques only modifying the speaking rate or overall shape of spectral contour, fine-grained spectro-temporal differences between dysarthric, elderly and normal speech are modelled using a novel set of speaker dependent (SD) generative adversarial networks (GAN) based data augmentation approaches in this paper. These flexibly allow both: a) temporal or speed perturbed normal speech spectra to be modified and closer to those of an impaired speaker when parallel speech data is available; and b) for non-parallel data, the SVD decomposed normal speech spectral basis features to be transformed into those of a target elderly speaker before being re-composed with the temporal bases to produce the augmented data for state-of-the-art TDNN and Conformer ASR system training. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute (9.61% and 6.4% relative) WER reduction on the TORGO and DementiaBank data respectively. Consistent performance improvements are retained after applying LHUC based speaker adaptation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题