论文标题
通过无监督的说话风格转移改善语音情感识别
Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer
论文作者
论文摘要
人类可以轻松地修改各种韵律属性,例如压力的放置和情感强度,以传达特定的情感,同时保持一致的语言内容。在这种能力的激励下,我们提出了Emoaug,这是一种新型风格的转移模型,旨在增强情绪表达并解决言语情感识别任务中的数据稀缺问题。 Emoaug由语义编码器和副语言编码器组成,分别代表口头和非语言信息。此外,解码器通过以无监督的方式对两个信息流进行调节来重建语音信号。一旦训练完成,Emoaug通过将不同的样式喂入副语言编码器,以不同的韵律属性(例如压力,节奏和强度)来丰富情感语音的表达。 Emoaug使我们能够为每个类生成相似数量的样本,以解决数据不平衡问题。 IEMOCAP数据集的实验结果表明,Emoaug可以在保留说话者身份和语义内容的同时成功传递不同的口语样式。此外,我们使用Emoaug增强数据的SER模型,并表明增强模型不仅超过了最先进的监督和自我监督的方法,而且还克服了由数据失衡引起的过度拟合问题。一些音频样本可以在我们的演示网站上找到。
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.