关于早期融合在多模式表示学习中的好处

论文标题

关于早期融合在多模式表示学习中的好处

On the Benefits of Early Fusion in Multimodal Representation Learning

论文作者

Barnum, George, Talukder, Sabera, Yue, Yisong

论文摘要

关于世界的明智推理通常需要从多种模式中整合数据，因为任何个人模式都可能包含不可靠或不完整的信息。仅在重大独立处理之后，多模式学习保险丝的先前工作才能输入模式。另一方面，大脑几乎立即进行多模式处理。常规多模式学习与神经科学之间的这种鸿沟表明，对早期多模式融合的详细研究可以改善人工多模式表示。为了促进对早期多模式融合的研究，我们创建了一个卷积的LSTM网络体系结构，同时处理音频和视觉输入，并允许我们选择音频和视觉信息组合的层。我们的结果表明，初始C-LSTM层中音频和视觉输入的立即融合会导致更高的性能网络，对于在音频和视觉输入中添加白噪声更强大。

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. Prior work in multimodal learning fuses input modalities only after significant independent processing. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题