MCVD：用于预测，生成和插值的有条件视频扩散

论文标题

MCVD：用于预测，生成和插值的有条件视频扩散

MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

论文作者

Voleti, Vikram, Jolicoeur-Martineau, Alexia, Pal, Christopher

论文摘要

视频预测是一项具有挑战性的任务。当前最先进的（SOTA）生成模型的视频帧质量往往很差，并且超出培训数据的概括是困难的。此外，现有的预测框架通常无法同时处理其他与视频相关的任务，例如无条件生成或插值。在这项工作中，我们使用基于概率的条件分数基于deno的扩散模型，为所有这些视频综合任务设计了一个通用框架，称为蒙版条件视频扩散（MCVD），以过去和/或未来框架为条件。我们以一种方式训练该模型，使我们可以随机掩盖过去的所有框架或将来所有框架。这个小说但直接的设置使我们能够训练能够执行广泛的视频任务的单个模型，特别是：未来/过去的预测 - 只有将未来/过去的框架掩盖时；无条件的一代 - 当过去和将来的框架都被掩盖时；和插值 - 当过去和未来的框架都没有被掩盖时。我们的实验表明，这种方法可以为各种视频产生高质量的帧。我们的MCVD模型是由简单的非旋转2D横向横向体系结构构建的，在框架块和生成框架块上进行条件。我们以阻碍的方式生成任意长度自动摄取的视频。我们的方法在标准视频预测和插值基准中产生SOTA结果，使用$ \ le $ 4 GPU在1-12天内测量培训模型的计算时间。项目页面：https：//mask-cond-video-diffusion.github.io;代码：https：//github.com/voletiv/mcvd-pytorch

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch

下载PDF全文

下载文献需遵守相关版权规定

论文标题