通过图卷积和图池融合分析未对准的多模式序列

论文标题

通过图卷积和图池融合分析未对准的多模式序列

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

论文作者

Mai, Sijie, Xing, Songlong, He, Jiaxuan, Zeng, Ying, Hu, Haifeng

论文摘要

在本文中，我们研究了多模式序列分析的任务，该任务旨在从视觉，语言和声学序列中得出推论。大多数现有作品通常集中在三种模式中的融合，主要是在单词级别上完成此任务，这在现实世界中是不切实际的。为了克服这个问题，我们试图解决对未对准的模态序列的多模式序列分析的任务，该序列仍然相对不受欢迎，并且更具挑战性。复发性神经网络（RNN）及其变体被广泛用于多模式序列分析，但由于其经常性性质，它们容易受到梯度消失/爆炸和高时间复杂性的影响。因此，我们提出了一个称为多模式图的新型模型，以研究图形神经网络（GNN）对建模多模式顺序数据的有效性。基于图的结构可以在时间维度中进行并行计算，并可以在长期不规则的序列中学习更长的时间依赖性。具体而言，我们的多模式图是层次结构的，可满足两个阶段，即内部和模式间动力学学习。对于第一阶段，使用图形卷积网络来学习模式内动力学。在第二阶段，鉴于多模式序列是未对齐的，因此通常认为的单词级融合无关。为此，我们设计了一个图形合并融合网络，以自动学习来自不同模式的各种节点之间的关联。此外，我们定义了多种构造顺序数据的邻接矩阵的方法。实验结果表明，我们的基于图的模型在两个基准数据集上达到了最先进的性能。

In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题