通过引入自然语言表示，将多个对象跟踪概括为看不见的域

论文标题

通过引入自然语言表示，将多个对象跟踪概括为看不见的域

Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

论文作者

Yu, En, Liu, Songtao, Li, Zhuoling, Yang, Jinrong, li, Zeming, Han, Shoudong, Tao, Wenbing

论文摘要

尽管现有的多对象跟踪（MOT）算法在各种基准上都获得了竞争性能，但几乎所有这些算法都在同一域上训练和验证模型。几乎没有研究MOT的域概括问题。为了弥合这一差距，我们首先绘制了这样的观察，即自然语言中包含的高级信息是不同跟踪域不变的领域。基于此观察，我们建议将自然语言表示形式引入视觉MOT模型，以提高域的泛化能力。但是，使用文本描述标记每个跟踪目标是不可行的。为了解决这个问题，我们设计了两个模块，即视觉上下文提示（VCP）和视觉语言混合（VLM）。具体而言，VCP根据输入帧生成视觉提示。 VLM关节生成的视觉提示中的信息和预定义曲目中的文本提示，以获取实例级伪文本描述，这对于不同的跟踪场景而言是不变的。通过在MOT17上的培训模型并在MOT20上验证它们，我们观察到我们所提出的模块生成的伪文本描述可改善通过较大边距的基于查询的跟踪器的概括性能。

Although existing multi-object tracking (MOT) algorithms have obtained competitive performance on various benchmarks, almost all of them train and validate models on the same domain. The domain generalization problem of MOT is hardly studied. To bridge this gap, we first draw the observation that the high-level information contained in natural language is domain invariant to different tracking domains. Based on this observation, we propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. However, it is infeasible to label every tracking target with a textual description. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM). Specifically, VCP generates visual prompts based on the input frames. VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.

下载PDF全文

下载文献需遵守相关版权规定

论文标题