使用说话者识别和基于BERT的模型的转移学习的多模式情感识别

论文标题

使用说话者识别和基于BERT的模型的转移学习的多模式情感识别

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

论文作者

Padi, Sarala, Sadjadi, Seyed Omid, Manocha, Dinesh, Sriram, Ram D.

论文摘要

自动情绪识别在计算机人类互动中起着关键作用，因为它有可能用情感智能丰富下一代人工智能。它在呼叫中心，游戏，个人助理和社交机器人中找到了客户和/或代表性行为分析中的应用程序。因此，越来越多的需求开发出强大的自动方法来分析和识别各种情绪。在本文中，我们提出了一个基于神经网络的情感识别框架，该框架使用了言语和文本方式的转移学习和微调模型的晚期融合。更具体地说，我们是基于剩余网络（RESNET）模型，该模型使用转移学习对大规模的扬声器识别任务进行培训，并使用频谱图扩大方法来识别语音的情绪，ii）使用微调的双向编码器（来自变形金刚）的微调双向编码器（BERT）的模型来表示和认识文本的情绪。然后，提出的系统使用晚期融合策略将基于重新系统和BERT的模型分数结合在一起，以进一步提高情绪识别性能。提出的多模式解决方案通过转移学习，数据增强和微调解决情感识别中的数据稀缺限制，从而提高了情绪识别模型的概括性能。我们评估了我们提出的多模式方法对交互式情绪二元运动捕获（IEMOCAP）数据集的有效性。实验结果表明，音频和基于文本的模型都改善了情绪识别性能，并且提出的多模式解决方案可在Iemocap基准上获得最新的结果。

Automatic emotion recognition plays a key role in computer-human interaction as it has the potential to enrich the next-generation artificial intelligence with emotional intelligence. It finds applications in customer and/or representative behavior analysis in call centers, gaming, personal assistants, and social robots, to mention a few. Therefore, there has been an increasing demand to develop robust automatic methods to analyze and recognize the various emotions. In this paper, we propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. More specifically, we i) adapt a residual network (ResNet) based model trained on a large-scale speaker recognition task using transfer learning along with a spectrogram augmentation approach to recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder representations from transformers (BERT) based model to represent and recognize emotions from the text. The proposed system then combines the ResNet and BERT-based model scores using a late fusion strategy to further improve the emotion recognition performance. The proposed multimodal solution addresses the data scarcity limitation in emotion recognition using transfer learning, data augmentation, and fine-tuning, thereby improving the generalization performance of the emotion recognition models. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题