论文标题
Pesqnet(损失)是否需要干净的参考输入?原始PESQ确实可以,但是ACR听力测试没有
Does a PESQNet (Loss) Require a Clean Reference Input? The Original PESQ Does, But ACR Listening Tests Don't
论文作者
论文摘要
语音质量(PESQ)的感知评估需要简洁的语音参考作为输入,但可以预测(无参考)绝对类别评级(ACR)测试的结果。在这项工作中,我们将完全卷积的复发性神经网络(FCRN)作为深噪声抑制(DNS)模型,具有非侵入性或侵入性的Pesqnet,其中只有后者可以访问干净的语音参考。 PESQNET用作介体,在DNS训练过程中提供感知损失,以最大程度地提高增强语音信号的PESQ评分。对于侵入性的Pesqnet,我们研究了两个称为早期融合(EF)和中融合(MF)Pesqnet的拓扑结构,并与非侵入性的Pesqnet进行比较,以评估和量化在DNS培训期间使用干净语音参考输入的好处。详细的分析表明,经过MF侵入性PESQNET训练的DNS优于Interspeech 2021 DNS挑战基线,并分别以MSE损失为0.23和0.12 PESQ点。此外,我们可以证明,与接受非侵入性Pesqnet训练的DNS相比,仅获得边缘益处。因此,作为ACR听力测试,PESQNET不一定需要干净的语音参考输入,开放了使用真实数据进行DNS培训的可能性。
Perceptual evaluation of speech quality (PESQ) requires a clean speech reference as input, but predicts the results from (reference-free) absolute category rating (ACR) tests. In this work, we train a fully convolutional recurrent neural network (FCRN) as deep noise suppression (DNS) model, with either a non-intrusive or an intrusive PESQNet, where only the latter has access to a clean speech reference. The PESQNet is used as a mediator providing a perceptual loss during the DNS training to maximize the PESQ score of the enhanced speech signal. For the intrusive PESQNet, we investigate two topologies, called early-fusion (EF) and middle-fusion (MF) PESQNet, and compare to the non-intrusive PESQNet to evaluate and to quantify the benefits of employing a clean speech reference input during DNS training. Detailed analyses show that the DNS trained with the MF-intrusive PESQNet outperforms the Interspeech 2021 DNS Challenge baseline and the same DNS trained with an MSE loss by 0.23 and 0.12 PESQ points, respectively. Furthermore, we can show that only marginal benefits are obtained compared to the DNS trained with the non-intrusive PESQNet. Therefore, as ACR listening tests, the PESQNet does not necessarily require a clean speech reference input, opening the possibility of using real data for DNS training.