论文标题

文本驱动的视频预测

Text-driven Video Prediction

论文作者

Song, Xue, Chen, Jingjing, Zhu, Bin, Jiang, Yu-Gang

论文摘要

当前的视频生成模型通常将信号转换为表明从输入(例如,图像,文本)或潜在空间(例如噪声向量)接收到的外观和运动的信号,以实现潜在代码采样引入的不确定性的随机生成过程。但是,这种一代模式对外观和运动都缺乏确定性的约束,从而导致无法控制和不良结果。为此,我们提出了一个新任务,称为文本驱动视频预测(TVP)。以第一个帧和文本字幕为输入,此任务旨在综合以下帧。具体而言,外观和运动组件由图像和字幕分别提供。解决TVP任务的关键取决于在文本说明中充分探索基本运动信息,从而促进了合理的视频生成。实际上,由于文本内容直接影响帧的运动变化,因此本质上是一个因果问题。为了研究文本在因果推断中的渐进运动信息中的能力,我们的TVP框架包含一个文本推理模块(TIM),生成逐步嵌入以调节后续帧的运动推断。尤其是一种结合全球运动语义的改进机制可确保相干产生。大量实验是在某些v2和单个移动的MNIST数据集上进行的。实验结果表明,我们的模型比其他基线实现了更好的结果,从而验证了所提出的框架的有效性。

Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing step-wise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源