论文标题

构建多模式表示的技术是否重要? - 比较分析

Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis

论文作者

Pawłowski, Maciej, Wróblewska, Anna, Sysko-Romańczuk, Sylwia

论文摘要

通过融合单个模式(例如文本,图像或音频)来创建有意义的表示形式是多模式学习的核心概念。尽管已经证明了几种用于构建多模式表示的技术,但尚未进行比较。因此,在给定情况下,可以期望哪种技术在选择这种技术时应考虑哪些因素,这是模棱两可的。本文探讨了建立多模式数据表示的最常见技术 - 晚期融合,早期融合和草图,并在分类任务中进行比较。实验是在三个数据集上进行的:Amazon评论,Movielens25M和Movielens1M数据集。通常,我们的结果证实,在亚马逊评论中,多模式表示能够将单峰模型的性能从0.919升至0.969的精度,而在Movielens2500万中的AUC的0.907至0.918。但是,两个Movielens数据集的实验表明有意义的输入数据对给定任务的重要性。在本文中,我们表明,构建多模式表示技术的选择对于获得最高模型的性能至关重要,这与适当的方式组合有关。这种选择依赖于:每种形态对分析的机器学习(ML)问题的影响; ML任务的类型;训练和预测阶段时的内存约束。

Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源