使用音频和文本的多模式情绪识别的对比正则化

论文标题

使用音频和文本的多模式情绪识别的对比正则化

Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

论文作者

Qian, Fan, Han, Jiqing

论文摘要

言语情感识别是迈向更自然的人类计算机互动（HCI）的重要一步。流行的方法是基于模型级融合的多模式情绪识别，这意味着可以对多模式信号进行编码以获取嵌入，然后将嵌入式串联在一起以进行最终分类。但是，由于噪声或其他因素的影响，每种方式并不总是趋向于相同的情绪类别，这会影响模型的概括。在本文中，我们通过对比度学习提出了一种新颖的正则化方法，用于使用音频和文本进行多模式情绪识别。通过引入一个歧视者来区分相同和不同的情感对之间的差异，我们明确地限制了每种模式的潜在代码以包含相同的情感信息，以减少噪声干扰并获得更具歧视性的表示。实验是在标准的Iemocap数据集上进行的，以进行4级情绪识别。与基线系统相比，结果显示出1.44 \％和1.53 \％的1.44 \％和1.53 \％的改善。

Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discriminative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题