论文标题
分析基于扩散的生成方法与语音恢复的判别方法
Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration
论文作者
论文摘要
基于扩散的生成模型对过去几年的计算机视觉和语音处理社区产生了很大影响。除了数据生成任务外,它们还被用于数据恢复任务,例如语音增强和缩放。传统上认为判别模型更强大,例如为了增强语音,最近已证明生成扩散方法可大大缩小该性能差距。在本文中,我们系统地比较了不同语音恢复任务的生成扩散模型和歧视方法的性能。为此,我们将基于扩散的语音增强的先前贡献扩展到复杂的时频域中的增强,以延伸到bandwith扩展任务。然后,我们将其与歧视训练的神经网络与相同的网络体系结构进行了比较,即三个恢复任务,即语音DeNoising,dereverberation和bandwidth扩展。我们观察到,生成方法在全球范围内的性能要比其在所有任务上的歧视性对应物都要好,这对于非加性失真模型的好处是最强的好处,例如在缩放和带宽扩展中。可以在https://uhh.de/inf-sp-sgmsemultitask上在线找到代码和音频示例
Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask