提炼视觉语言预训练以与弱监督的时间动作定位合作

论文标题

提炼视觉语言预训练以与弱监督的时间动作定位合作

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

论文作者

Ju, Chen, Zheng, Kunhao, Liu, Jinxiang, Zhao, Peisen, Zhang, Ya, Chang, Jianlong, Wang, Yanfeng, Tian, Qi

论文摘要

弱监督的时间动作本地化（WTAL）学会了仅使用类别标签来检测和对行动实例进行分类。大多数方法广泛地采用基于现成的基于基础分类的预训练（CBP）来生成视频功能以进行动作本地化。但是，分类和本地化之间的不同优化目标使时间本地化的结果遭受了严重的不完整问题。为了解决这个问题而没有其他注释，本文考虑将自由动作知识从视觉前训练（VLP）提炼出来，因为我们出人意料地观察到Vanilla VLP的本地化结果有一个过于完整的问题，这只是CBP结果的补充。为了融合这种互补性，我们提出了一个新型的蒸馏式合作框架，分别用两个分支充当CBP和VLP。该框架通过双分支替代培训策略进行了优化。具体而言，在B步骤中，我们从CBP分支中提取自信的背景伪标签；在F步骤中，自信的前景伪标签是从VLP分支中提炼出来的。结果，双分支互补性有效地融合在一起以促进强大的联盟。对Thumos14和ActivityNet1.2的广泛实验和消融研究表明，我们的方法显着胜过最先进的方法。

Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题