Stoi-net：基于深度学习的非侵入性语音可理解性评估模型

论文标题

Stoi-net：基于深度学习的非侵入性语音可理解性评估模型

STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model

论文作者

Zezario, Ryandhimas E., Fu, Szu-Wei, Fuh, Chiou-Shann, Tsao, Yu, Wang, Hsin-Min

论文摘要

最客观的语音可理解性评估指标的计算需要简洁的语音作为参考。这样的要求可能会限制这些指标在实际情况下的适用性。为了克服这一限制，我们提出了一个基于深度学习的非侵入性语音可理解评估模型，即Stoi-net。 STOI-NET的输入和输出分别是语音光谱特征和预测的Stoi得分。该模型是由卷积神经网络和双向长期记忆（CNN-BLSTM）结构与多种注意机制的结合形成的。实验结果表明，在用嘈杂和增强语音话语测试时，由Stoi-NET估计的Stoi评分与实际的Stoi分数有良好的相关性。对于可见测试条件（训练集涉及测试扬声器和噪声类型）和看不见的测试条件（测试扬声器和噪声类型不涉及训练集中），相关值分别为0.97和0.83。结果证实了Stoi-net准确预测Stoi得分而无需参考干净语音的能力。

The calculation of most objective speech intelligibility assessment metrics requires clean speech as a reference. Such a requirement may limit the applicability of these metrics in real-world scenarios. To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The input and output of STOI-Net are speech spectral features and predicted STOI scores, respectively. The model is formed by the combination of a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism. Experimental results show that the STOI score estimated by STOI-Net has a good correlation with the actual STOI score when tested with noisy and enhanced speech utterances. The correlation values are 0.97 and 0.83, respectively, for the seen test condition (the test speakers and noise types are involved in the training set) and the unseen test condition (the test speakers and noise types are not involved in the training set). The results confirm the capability of STOI-Net to accurately predict the STOI scores without referring to clean speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题