关于使用特定于模式的大规模预训练编码器进行多模式分析

论文标题

关于使用特定于模式的大规模预训练编码器进行多模式分析

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

论文作者

Ando, Atsushi, Masumura, Ryo, Takashima, Akihiko, Suzuki, Satoshi, Makishima, Naoki, Suzuki, Keita, Moriya, Takafumi, Ashihara, Takanori, Sato, Hiroshi

论文摘要

本文研究了多模式情感分析（MSA）的特定于模态的大规模预训练编码器的有效性和实施。尽管已经报道了在各个领域的预训练编码器的有效性，但常规的MSA方法仅采用它们才能进行语言模式，并且尚未研究其应用。本文比较了大规模预训练的编码器所产生的功能，并具有传统的启发式特征。每种最大的预训练的编码器都用于每种模式的公开训练；剪辑，WAVLM和BERT分别用于视觉，声学和语言方式。两个数据集上的实验表明，具有特定域的预训练编码器的方法比在单峰和多峰方案中具有常规特征的方法更好。我们还发现，使用编码器的中间层的输出比输出层的输出更好。这些代码可在https://github.com/ando-hub/msa_pretrain上找到。

This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded by large-scale pre-trained encoders with conventional heuristic features. One each of the largest pre-trained encoders publicly available for each modality are used; CLIP-ViT, WavLM, and BERT for visual, acoustic, and linguistic modalities, respectively. Experiments on two datasets reveal that methods with domain-specific pre-trained encoders attain better performance than those with conventional features in both unimodal and multimodal scenarios. We also find it better to use the outputs of the intermediate layers of the encoders than those of the output layer. The codes are available at https://github.com/ando-hub/MSA_Pretrain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题