使用视觉原告语言信息的连续情感识别：ABAW3的技术报告

论文标题

使用视觉原告语言信息的连续情感识别：ABAW3的技术报告

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

论文作者

Zhang, Su, An, Ruyi, Ding, Yi, Guan, Cuntai

论文摘要

我们提出了一个使用视觉原告语言信息的跨模式共同注意模型，用于连续情绪识别。该模型由四个块组成。视觉，音频和语言块用于学习多模式输入的时空特征。共同发音块的设计旨在将学习的特征与多头共同注意机制融合在一起。从视觉块中编码的视觉编码与注意力特征相连，以强调视觉信息。为了充分利用数据并减轻过度使用，在培训和验证集中进行了交叉验证。一致性相关系数（CCC）中心用于合并每个折叠的结果。测试集中达到的CCC的价为0.520美元，唤醒的CCC为0.602美元，这大大优于基线方法，相应的CCC分别为0.180和0.170和0.170，价值和唤醒。该代码可在https://github.com/sucv/abaw3上找到。

We propose a cross-modal co-attention model for continuous emotion recognition using visual-audio-linguistic information. The model consists of four blocks. The visual, audio, and linguistic blocks are used to learn the spatial-temporal features of the multi-modal input. A co-attention block is designed to fuse the learned features with the multi-head co-attention mechanism. The visual encoding from the visual block is concatenated with the attention feature to emphasize the visual information. To make full use of the data and alleviate over-fitting, cross-validation is carried out on the training and validation set. The concordance correlation coefficient (CCC) centering is used to merge the results from each fold. The achieved CCC on the test set is $0.520$ for valence and $0.602$ for arousal, which significantly outperforms the baseline method with the corresponding CCC of 0.180 and 0.170 for valence and arousal, respectively. The code is available at https://github.com/sucv/ABAW3.

下载PDF全文

下载文献需遵守相关版权规定

论文标题