论文标题
TransVod:使用时空变压器检测端到端视频对象检测
TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers
论文作者
论文摘要
已经提出了检测变压器(DETR)和可变形的DETR,以消除对物体检测中许多手工设计的组件的需求,同时证明了作为先前复杂的手工制作的检测器的良好性能。但是,它们在视频对象检测(VOD)上的性能尚未得到很好的探索。在本文中,我们提出了TransVod,这是基于时空变压器体系结构的第一个端到端视频对象检测系统。本文的第一个目标是简化VOD的管道,有效地消除了对特征聚合的许多手工制作的组件的需求,例如光流模型,关系网络。此外,从DETR中的对象查询设计中受益,我们的方法不需要复杂的后处理方法,例如SEQ-NMS。特别是,我们提出一个颞变压器,以汇总每个帧的空间对象查询和特征记忆。我们的颞变压器由两个组件组成:临时查询编码器(TQE)以融合对象查询,以及时间变形变压器解码器(TDTD)以获得当前的帧检测结果。这些设计在Imagenet VID数据集上增加了强大的基线变形DETR(3%-4%地图)。然后,我们提出了两个改进的TransVod版本,包括TransVod ++和TransVod Lite。前者通过动态卷积将对象级信息融合到对象查询中,而后者将整个视频剪辑建模为输出以加快推理时间。我们对实验部分中所有三个模型进行详细分析。特别是,我们提出的TransVod ++在Imagenet VID上以90.0%的地图的准确性设置了新的最新记录。我们提出的TransVod Lite在单个V100 GPU设备上以30 fps的速度运行时,用83.7%的地图实现了最佳的速度和准确性权衡。
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single V100 GPU device.