部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Clover: Towards A Unified Video-Language Alignment and Fusion Model

论文作者

Huang, Jingjia, Li, Yinan, Feng, Jiashi, Wu, Xinglong, Sun, Xiaoshuai, Ji, Rongrong

论文摘要

构建一个通用视频语言模型，以解决各种视频理解任务（\ emph {e.g。}，文本视频检索，视频询问答案）是对机器学习领域的开放挑战。为了实现这一目标，最近的作品通过堆叠单模式和跨模式特征编码器并通过配对的对比度前文本任务来构建模型。尽管提供了有吸引力的通用性，但结果模型必须在效率和性能之间妥协。他们主要采用不同的体系结构来处理不同的下游任务。我们发现这是因为配对训练不能很好地\ emph {align}和\ emph {fuse}来自不同模态的功能。然后，我们将\ textbf {Clover} \ TextemDash介绍了一个相关的视频语言预训练方法\ TextemDash，介绍了一个通用视频语言模型，该模型用于求解具有性能或效率何不损害的多个视频理解任务。它通过新型的三模式比对预训练任务来改善跨模式特征对齐和融合。此外，我们建议通过合并来自语义蒙版样本和新的成对排名损失的学习来增强三模式对齐。三叶草在多个下游任务上建立了新的最新技术，包括零射门和微调设置的三个检索任务，以及八个视频问答任务。代码和预培训模型将在\ url {https://github.com/leeyn-43/clover}发布。

Building a universal Video-Language model for solving various video understanding tasks (\emph{e.g.}, text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well \emph{align} and \emph{fuse} features from different modalities. We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at \url{https://github.com/LeeYN-43/Clover}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题