动作形式：与变压器的局部动作瞬间

论文标题

动作形式：与变压器的局部动作瞬间

ActionFormer: Localizing Moments of Actions with Transformers

论文作者

Zhang, Chenlin, Wu, Jianxin, Li, Yin

论文摘要

基于自我注意力的变压器模型已显示出令人印象深刻的图像分类和对象检测结果，并且最近以进行视频理解。受到这一成功的启发，我们研究了变压器网络在视频中的时间动作本地化的应用。为此，我们提出了ActionFormer，这是一个简单而强大的模型，可在不使用动作建议或依靠预定义的锚点窗口中识别动作并在单次中识别其类别。 ActionFormer将多尺度特征表示与局部自我发挥作用相结合，并使用轻加权解码器对每个时刻进行分类并估算相应的动作边界。我们表明，这种精心策划的设计会在先前的工作中进行了重大改进。如果没有铃铛和口哨声，ActionFormer在Thumos14上的TIOU = 0.5的地图达到了71.0％的地图，表现优于最佳先前模型的绝对百分比14.1。此外，ActionFormer在ActivityNet 1.3（平均地图36.6％）和Epic-kitchens 100（先前工作的平均地图+13.5％）上显示出很强的结果。我们的代码可从http://github.com/happyharrycn/actionformer_release获得。

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.

下载PDF全文

下载文献需遵守相关版权规定

论文标题