论文标题
带有状态空间视频模型的长电影剪辑分类
Long Movie Clip Classification with State-Space Video Models
论文作者
论文摘要
大多数现代视频识别模型旨在在短视频剪辑上运行(例如,长度为5-10)。因此,将此类模型应用于长时间的电影理解任务是一项挑战,通常需要复杂的长期时间推理。最近引入的视频变形金刚通过使用远程时间自我注意来部分解决此问题。但是,由于自我注意力的二次成本,这种模型通常是昂贵且不切实际的。取而代之的是,我们提出了Vis4mer,这是一种有效的远程视频模型,结合了自我注意力的优势和最近引入的结构化状态空间序列(S4)层。我们的模型使用标准变压器编码器进行短距离时空特征提取,以及多尺度的时间S4解码器,用于随后的远程时间推理。通过逐步减少每个解码器层处的时空特征分辨率和通道维度,Vis4mer在视频中学习了复杂的长期时空依赖性。此外,比相应的基于纯的自我注意力的模型,Vis4mer的价格更快为$ 2.63 \ times $ $ $ $ 8 \ times $ $ GPU内存。此外,Vis4mer实现最先进的结果,在长期视频理解(LVU)基准中,$ 9 $ $ 9 $中的$ 6 $中的$ 6 $。此外,我们表明我们的方法成功地将其推广到其他领域,从而在早餐和硬币程序活动数据集上取得了竞争成果。该代码可公开可用:https://github.com/md-mohaiminul/vis4mer。
Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose ViS4mer, an efficient long-range video model that combines the strengths of self-attention and the recently introduced structured state-space sequence (S4) layer. Our model uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning. By progressively reducing the spatiotemporal feature resolution and channel dimension at each decoder layer, ViS4mer learns complex long-range spatiotemporal dependencies in a video. Furthermore, ViS4mer is $2.63\times$ faster and requires $8\times$ less GPU memory than the corresponding pure self-attention-based model. Additionally, ViS4mer achieves state-of-the-art results in $6$ out of $9$ long-form movie video classification tasks on the Long Video Understanding (LVU) benchmark. Furthermore, we show that our approach successfully generalizes to other domains, achieving competitive results on the Breakfast and the COIN procedural activity datasets. The code is publicly available at: https://github.com/md-mohaiminul/ViS4mer.