使用T分布的标签不确定性建模和语音情感识别的预测

论文标题

使用T分布的标签不确定性建模和语音情感识别的预测

Label Uncertainty Modeling and Prediction for Speech Emotion Recognition using t-Distributions

论文作者

Prabhu, Navin Raj, Lehmann-Willenbrock, Nale, Gerkmann, Timo

论文摘要

由于不同的人对他人的情感表达方式有所不同，因此他们在唤醒和价上的注释本身是主观的。为了解决这个问题，这些情绪注释通常由多个注释者收集，并在注释者之间平均，以获取唤醒和价值的标签。但是，除了平均水平外，标签的不确定性也令人感兴趣，还应对自动情绪识别进行建模和预测。在文献中，为简单起见，标签不确定性建模通常以高斯对收集的注释的假设进行处理。但是，由于注释者的数量通常由于资源限制而相当小，因此我们认为高斯方法是一个相当粗略的假设。相比之下，在这项工作中，我们建议使用学生的T分布来对标签分布进行建模，这使我们可以考虑可用的注释数量。使用此模型，我们将基于相应的Kullback-Leibler差异函数得出相应的损失函数，并使用它来训练估计器以分布情绪标签，从中可以推断出平均值和不确定性。通过定性和定量分析，我们展示了T分布比高斯分布的好处。我们在AVEC'16数据集上验证了建议的方法。结果表明，我们基于T分布的方法对高斯方法进行了改进，而最先进的不确定性建模会导致基于语音的情感识别以及最佳甚至更快的收敛性。

As different people perceive others' emotional expressions differently, their annotation in terms of arousal and valence are per se subjective. To address this, these emotion annotations are typically collected by multiple annotators and averaged across annotators in order to obtain labels for arousal and valence. However, besides the average, also the uncertainty of a label is of interest, and should also be modeled and predicted for automatic emotion recognition. In the literature, for simplicity, label uncertainty modeling is commonly approached with a Gaussian assumption on the collected annotations. However, as the number of annotators is typically rather small due to resource constraints, we argue that the Gaussian approach is a rather crude assumption. In contrast, in this work we propose to model the label distribution using a Student's t-distribution which allows us to account for the number of annotations available. With this model, we derive the corresponding Kullback-Leibler divergence based loss function and use it to train an estimator for the distribution of emotion labels, from which the mean and uncertainty can be inferred. Through qualitative and quantitative analysis, we show the benefits of the t-distribution over a Gaussian distribution. We validate our proposed method on the AVEC'16 dataset. Results reveal that our t-distribution based approach improves over the Gaussian approach with state-of-the-art uncertainty modeling results in speech-based emotion recognition, along with an optimal and even faster convergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题