论文标题
使用深度学习的非本地英语演讲中发音错误的自动检测
Automated detection of pronunciation errors in non-native English speech employing deep learning
论文作者
论文摘要
尽管近年来取得了重大进展,但现有的计算机辅助发音训练(CAPT)方法检测出具有相对较低精度的发音误差(精度为60%,为40%-80%的回忆)。这个博士工作提出了用于检测非本地(L2)英语语音发音误差的新型深度学习方法,以优于AUC度量的最新方法(曲线下的区域),即从0.528到0.749。现有方法的问题之一是对发音错误检测模型的可靠培训所需的带注释错误的语音的可用性较低。因此,对发音错误的检测已重新制定,以产生合成错误发音的语音的任务。直观地,如果我们可以模仿错误发音的语音并产生任何数量的训练数据,那么检测发音错误将更加有效。此外,提出了一种新颖的端到端多任务技术,以直接检测发音误差。在亚马逊上使用了发音误差检测模型来自动检测合成语音中的发音误差,以加速研究新的语音合成方法。已经证明,所提出的深度学习方法适用于检测和重建违反语音的任务。
Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area under the Curve) by 41%, i.e., from 0.528 to 0.749. One of the problems with existing CAPT methods is the low availability of annotated mispronounced speech needed for reliable training of pronunciation error detection models. Therefore, the detection of pronunciation errors is reformulated to the task of generating synthetic mispronounced speech. Intuitively, if we could mimic mispronounced speech and produce any amount of training data, detecting pronunciation errors would be more effective. Furthermore, to eliminate the need to align canonical and recognized phonemes, a novel end-to-end multi-task technique to directly detect pronunciation errors was proposed. The pronunciation error detection models have been used at Amazon to automatically detect pronunciation errors in synthetic speech to accelerate the research into new speech synthesis methods. It was demonstrated that the proposed deep learning methods are applicable in the tasks of detecting and reconstructing dysarthric speech.