与变压器进行多个对象跟踪的联合时空建模和外观建模

论文标题

与变压器进行多个对象跟踪的联合时空建模和外观建模

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

论文作者

Dai, Peng, Feng, Yiqiang, Weng, Renliang, Zhang, Changshui

论文摘要

多个对象跟踪（MOT）的最新趋势正朝着利用深度学习来提高跟踪性能。在本文中，我们提出了一种名为TransStam的新颖解决方案，该解决方案利用变压器有效地模拟了每个对象的外观特征和对象之间的空间 - 周期性关系。 TransStam由两个主要部分组成：（1）编码器利用变压器的强大自我发项机制来学习每个轨道的判别特征；（2）解码器采用标准的跨注意机制，通过考虑时空和外观特征来模拟曲目和检测之间的亲和力。 TransStam具有两个主要优势：（1）它仅基于编码器架构，并享有紧凑的网络设计，因此在计算上有效；（2）它可以在一个模型中有效地学习时空和外观特征，从而实现更好的跟踪精度。该方法在包括MOT16，MOT17和MOT20在内的多个公共基准上进行了评估，并且就所有基准测试的先前最先进的方法而言，IDF1和HOTA都能取得明显的性能提高。我们的代码可在\ url {https://github.com/icicle4/transtam}上找到。

The recent trend in multiple object tracking (MOT) is heading towards leveraging deep learning to boost the tracking performance. In this paper, we propose a novel solution named TransSTAM, which leverages Transformer to effectively model both the appearance features of each object and the spatial-temporal relationships among objects. TransSTAM consists of two major parts: (1) The encoder utilizes the powerful self-attention mechanism of Transformer to learn discriminative features for each tracklet; (2) The decoder adopts the standard cross-attention mechanism to model the affinities between the tracklets and the detections by taking both spatial-temporal and appearance features into account. TransSTAM has two major advantages: (1) It is solely based on the encoder-decoder architecture and enjoys a compact network design, hence being computationally efficient; (2) It can effectively learn spatial-temporal and appearance features within one model, hence achieving better tracking accuracy. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA with respect to previous state-of-the-art approaches on all the benchmarks. Our code is available at \url{https://github.com/icicle4/TranSTAM}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题