具有现成图像文本功能

论文标题

具有现成图像文本功能

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

论文作者

Rathod, Vivek, Seybold, Bryan, Vijayanarasimhan, Sudheendra, Myers, Austin, Gu, Xiuye, Birodkar, Vighnesh, Ross, David A.

论文摘要

在未修剪视频中检测动作不应仅限于一组封闭的课程。我们提出了一种使用预验证的图像文本共包装的开放式时间动作检测的简单但有效的策略。尽管接受了静态图像而不是视频的培训，但我们表明，图像文本共插入使OpenVocabulary Performance具有与完全监督的模型具有竞争力。我们表明，可以通过将图像文本功能与编码本地运动（例如基于光流的功能或其他模式）（例如音频）的功能相结合，可以进一步提高性能。此外，我们为活动网络数据集提出了一个更合理的开放式评估设置，其中类别拆分基于相似性而不是随机分配。

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题