谐波在基于DNN的与生物学启发的单声道语音隔离系统中起关键作用

论文标题

谐波在基于DNN的与生物学启发的单声道语音隔离系统中起关键作用

Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems

论文作者

Parikh, Rahil, Kavalerov, Ilya, Espy-Wilson, Carol, Shamma, Shihab

论文摘要

深度学习的最新进展导致语音隔离模型的急剧改善。尽管他们的成功和不断增长的适用性，但很少努力分析这些网络学会执行隔离的基本原则。在这里，我们分析了谐波在两个基于最新的深神经网络（DNN）模型-Conv-TASNET和DPT-NET上的作用。我们通过自然语音的混合物与稍微操纵的无谐语音来评估它们的表现，在这些语音中，谐波稍微抖动。我们发现，如果一个源甚至略微谐音，则性能会显着恶化，例如，不可察觉的3％谐波抖动使Conv-Tasnet的性能从15.4 dB降低到0.70 dB。训练该模型的非壁声语音并不能弥补这种敏感性，而是导致自然语音混合物的表现较差，从而使非野蛮性成为DNN模型中强大的对抗性因素。此外，其他分析表明，DNN算法与生物学启发的算法显着偏离，这些算法主要依赖于时机提示，而不是谐音来隔离语音。

Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from biologically inspired algorithms that rely primarily on timing cues and not harmonicity to segregate speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题