论文标题
通过感知动机的优化和双重变换来增强语音
Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations
论文作者
论文摘要
为了解决单声道语音增强问题,已经进行了许多研究,以通过时间域进行操作来增强语音,以便从语音混合物中学到的内域或时间 - 固定的全乐队短时间傅立叶变换(STFT)光谱图中的传频结构域。最近,已经提出了一些关于基于子频段的语音增强的研究。通过通过子兰频谱图上的操作增强语音,这些研究表明了DNS2020基准数据集的竞争性能。尽管有吸引力,但这个新的研究方向尚未得到充分探索,并且仍然有改进的余地。因此,在这项研究中,我们深入研究了最新的研究方向,并提出了一个基于子频段的语音增强系统,具有感知动机的优化和双重变换(称为PT-FSE)。特别是,我们提出的PT-FSE模型通过三项努力改善了其主链(一种全频段和子融合模型)。首先,我们设计了一个旨在加强全局频率相关性的频率变换模块。然后引入时间转换以捕获远距离时间上下文。最后,提出了一种新型损失,具有人类听觉感知的性质杠杆作用,以促进该模型专注于低频增强。为了验证我们提出的模型的有效性,在DNS2020数据集上进行了广泛的实验。实验结果表明,我们的PT-FSE系统在其骨架上取得了重大改进,但也比当前的最新面积胜过,而比SOTA小27 \%。在基准数据集上,NB-PESQ平均为3.57,我们的系统提供了迄今报告的最佳语音增强结果。
To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.