论文标题

使用新型胶囊网模型的情感演讲者识别

Emotional Speaker Identification using a Novel Capsule Nets Model

论文作者

Nassif, Ali Bou, Shahin, Ismail, Elnagar, Ashraf, Velayudhan, Divya, Alhudhaif, Adi, Polat, Kemal

论文摘要

说话者识别系统被广泛用于各种应用程序,以通过其声音来识别一个人。但是,语音信号的高度可变性使这是一项具有挑战性的任务。处理情绪变化非常困难,因为情绪会改变一个人的语音特征。因此,声学特征与在中性环境中训练模型的声学特征不同。因此,接受中性语音训练的说话者识别模型无法在情感压力下正确识别说话者。尽管使用卷积神经网络(CNN)取得了很大的进步,但CNN不能利用低级特征之间的空间关联。受到最近引入胶囊网络(CAPSNET)的启发,这些胶囊网络基于深度学习,以克服CNN在保留低级特征与合并技术之间保持姿势关系的不足,因此研究了使用capsnets从情感语音记录中识别出capsnets的表现。使用三个不同的语音数据库(即emirati语音数据库,SUSAS数据集和Ravdess(Open-Act-Access))提出并评估了基于CAPSNET的说话者识别模型。还将提出的模型与基线系统进行了比较。实验结果表明,新型的capsnet模型可以更快地训练,并比当前的最新方案提供更好的结果。还通过有或没有解码器网络来改变路由算法对说话者识别性能的影响。

Speaker recognition systems are widely used in various applications to identify a person by their voice; however, the high degree of variability in speech signals makes this a challenging task. Dealing with emotional variations is very difficult because emotions alter the voice characteristics of a person; thus, the acoustic features differ from those used to train models in a neutral environment. Therefore, speaker recognition models trained on neutral speech fail to correctly identify speakers under emotional stress. Although considerable advancements in speaker identification have been made using convolutional neural networks (CNN), CNNs cannot exploit the spatial association between low-level features. Inspired by the recent introduction of capsule networks (CapsNets), which are based on deep learning to overcome the inadequacy of CNNs in preserving the pose relationship between low-level features with their pooling technique, this study investigates the performance of using CapsNets in identifying speakers from emotional speech recordings. A CapsNet-based speaker identification model is proposed and evaluated using three distinct speech databases, i.e., the Emirati Speech Database, SUSAS Dataset, and RAVDESS (open-access). The proposed model is also compared to baseline systems. Experimental results demonstrate that the novel proposed CapsNet model trains faster and provides better results over current state-of-the-art schemes. The effect of the routing algorithm on speaker identification performance was also studied by varying the number of iterations, both with and without a decoder network.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源