CCATMO：卷积上下文感知的变压器网络，用于非侵入性语音质量评估

论文标题

CCATMO：卷积上下文感知的变压器网络，用于非侵入性语音质量评估

CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment

论文作者

Liu, Yuchen, Yang, Li-Chia, Pawlicki, Alex, Stamenovic, Marko

论文摘要

语音质量评估一直是许多与语音通信有关的应用程序（例如电话和在线会议）的关键组成部分。传统的侵入性语音质量评估需要清除降解的话语，以提供准确的质量测量。这项要求限制了这些方法在实际情况下的可用性。另一方面，非犯罪的主观测量是评估语音质量的``黄金标准''，因为人类听众可以简单地评估任何退化的语音的质量。和扭曲类型并将我们的结果提交给我们的实验。

Speech quality assessment has been a critical component in many voice communication related applications such as telephony and online conferencing. Traditional intrusive speech quality assessment requires the clean reference of the degraded utterance to provide an accurate quality measurement. This requirement limits the usability of these methods in real-world scenarios. On the other hand, non-intrusive subjective measurement is the ``golden standard" in evaluating speech quality as human listeners can intrinsically evaluate the quality of any degraded speech with ease. In this paper, we propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge. Our experiments show that CCAT provides promising MOS predictions compared to current state-of-art non-intrusive speech assessment models with average Pearson correlation coefficient (PCC) increasing from 0.530 to 0.697 and average RMSE decreasing from 0.768 to 0.570 compared to the baseline model on the challenge evaluation test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题