论文标题
使用全极伽马酮滤清器库来改善基于流量的语音增强,以进行有条件输入表示
Improved Normalizing Flow-Based Speech Enhancement using an All-pole Gammatone Filterbank for Conditional Input Representation
论文作者
论文摘要
近年来,言语增强的深层生成模型受到了越来越多的关注。最突出的例子是生成的对抗网络(GAN),而标准化流量(NF)尽管潜力也较少。在先前的工作的基础上,提出了建筑修改,并研究了不同的条件输入表示。尽管在相关作品中是一个共同的选择,但MEL-SPECTROGRAM在给定情况下证明是不足的。另外,提出了具有高时间分辨率的新型全极γ滤纸(APG)。尽管计算评估度量结果表明,基于最新的GAN方法的性能最佳,但通过听力测试的感知评估表明,提出的NF方法(基于时域和APG)表现最佳,尤其是在较低的SNR中。平均而言,APG输出被评为具有良好的质量,包括GAN在内的其他方法无与伦比。
Deep generative models for Speech Enhancement (SE) received increasing attention in recent years. The most prominent example are Generative Adversarial Networks (GANs), while normalizing flows (NF) received less attention despite their potential. Building on previous work, architectural modifications are proposed, along with an investigation of different conditional input representations. Despite being a common choice in related works, Mel-spectrograms demonstrate to be inadequate for the given scenario. Alternatively, a novel All-Pole Gammatone filterbank (APG) with high temporal resolution is proposed. Although computational evaluation metric results would suggest that state-of-the-art GAN-based methods perform best, a perceptual evaluation via a listening test indicates that the presented NF approach (based on time domain and APG) performs best, especially at lower SNRs. On average, APG outputs are rated as having good quality, which is unmatched by the other methods, including GAN.