论文标题
通过不完整的数据,自我发挥的融合,用于视听情绪识别
Self-attention fusion for audiovisual emotion recognition with incomplete data
论文作者
论文摘要
在本文中,我们考虑了多模式数据分析的问题,并具有视听情感识别的用例。我们提出了一种能够从原始数据中学习的体系结构,并用不同的方式融合机制描述了它的三种变体。尽管以前的大多数作品都考虑了推断期间始终存在两种方式的理想情况,但我们在不受约束的环境中评估了模型的鲁棒性,在这种环境中,一种模式不存在或嘈杂,并提出了一种方法,以一种以一种模态辍学的形式来减轻这些限制的方法。最重要的是,我们发现遵循这种方法不仅在一种模式的缺席/嘈杂表示下会大大提高性能,而且还可以在标准理想环境中提高性能,从而优于竞争方法。
In this paper, we consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms. While most of the previous works consider the ideal scenario of presence of both modalities at all times during inference, we evaluate the robustness of the model in the unconstrained settings where one modality is absent or noisy, and propose a method to mitigate these limitations in a form of modality dropout. Most importantly, we find that following this approach not only improves performance drastically under the absence/noisy representations of one modality, but also improves the performance in a standard ideal setting, outperforming the competing methods.