论文标题
长视频理解的两流变压器体系结构
Two-Stream Transformer Architecture for Long Video Understanding
论文作者
论文摘要
纯视觉变压器体系结构对于简短的视频分类和动作识别任务非常有效。但是,由于自我注意力的二次复杂性和缺乏归纳偏差,变压器是资源密集的,并且遭受了数据效率低下的困扰。长形式的视频理解任务扩大了变形金刚的数据和内存效率问题,使当前方法无法在数据或内存限制域上实施。本文介绍了有效的时空注意网络(Stan),该网络使用两流变压器体系结构来模拟静态图像特征和时间上下文特征之间的依赖性。我们提出的方法可以在单个GPU上进行长达两分钟的视频,这是数据效率的,并且可以在几个长期的视频理解任务上实现SOTA性能。
Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.