论文标题
预先训练的语音表示作为在线会议申请中语音质量评估的功能提取器
Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications
论文作者
论文摘要
在线会议申请中的语音质量通常通过人类判断以平均意见评分(MOS)度量的形式进行评估。由于这种劳动密集型方法对于大多数情况下的大规模语音质量评估是不可行的,因此通过深层神经网络(DNN)的端到端培训,重点已转向自动MOS预测。我们建议没有从头开始训练网络,而是提议利用基于PREAD WAV2VEC的XLS-R模型的语音表示。但是,这种模型的参数数量超过了特定于任务的DNN,这是几个数量级的阶数,这对在较小的数据集中产生的微调过程提出了挑战。因此,我们选择在特征提取而不是微调设置中使用XLS-R的预训练的语音表示,从而大大减少了可训练的模型参数的数量。我们将我们提出的基于XLS-R的特征提取器与基于MEL频率的CEPSTRAL系数(MFCC)的特征提取器进行了比较,并尝试使用双向长期短期记忆(BISTM)和注意力集合的各种组合,以及对特征提取器输出训练的网络进行训练的网络。我们证明了预先训练的XLS-R嵌入方式的性能提高,术语减少了均方根误差(RMSE),这是在会议2022 2022 MOS预测任务上的性能。
Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.