通过信息树的端到端建模，用于一声自然语言空间视频接地

论文标题

通过信息树的端到端建模，用于一声自然语言空间视频接地

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

论文作者

Li, Mengze, Wang, Tianbao, Zhang, Haoyu, Zhang, Shengyu, Zhao, Zhou, Miao, Jiaxu, Zhang, Wenqiao, Tan, Wenming, Wang, Jin, Wang, Peng, Pu, Shiliang, Wu, Fei

论文摘要

自然语言空间视频接地旨在以描述性句子作为查询来检测视频帧中的相关对象。尽管取得了巨大进步，但大多数现有的方法都取决于密集的视频框架注释，这需要大量的人类努力。为了在有限的注释预算下实现有效的基础，我们研究了一声视频接地，并以端到端的方式在所有视频框架中学习自然语言。端到端一击视频接地的一个主要挑战是存在与语言查询或标记框架无关的视频框架。另一个挑战与有限的监督有关，这可能导致无效的表示学习。为了应对这些挑战，我们通过信息树设计了一个端到端模型，以进行一击视频接地（IT-OS）。它的关键模块信息树可以根据分支搜索和分支裁剪技术消除无关框架的干扰。此外，根据信息树提出了几个自我监督的任务，以改善标签不足的表示形式学习。基准数据集上的实验证明了我们的模型的有效性。

Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frames. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its key module, the information tree, can eliminate the interference of irrelevant frames based on branch search and branch cropping techniques. In addition, several self-supervised tasks are proposed based on the information tree to improve the representation learning under insufficient labeling. Experiments on the benchmark dataset demonstrate the effectiveness of our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题