多模式原型增强网络，用于几个射击动作识别

论文标题

多模式原型增强网络，用于几个射击动作识别

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

论文作者

Ni, Xinzhe, Liu, Yong, Wen, Hao, Ji, Yatai, Xiao, Jing, Yang, Yujiu

论文摘要

当前的几次动作识别的方法主要属于Protonet之后的度量学习框架，这证明了原型的重要性。尽管它们的性能相对较好，但多模式信息的效果被忽略了，例如标签文本。在这项工作中，我们提出了一种新型的多模式原型增强网络（MORN），该网络（MORN）使用标签文本的语义信息作为多模式信息来增强原型。引入了剪辑视觉编码器和冷冻夹文本编码器，以获得具有良好多模式初始化的功能。然后在视觉流中，视觉原型由视觉原型计算的模块计算。在文本流中，使用语义增强（SE）模块和充气操作来获取文本原型。然后，通过多模式原型增强（MPE）模块计算最终的多模式原型。此外，我们定义了原型相似性差异（PRIDE）来评估原型的质量，该原型用于验证我们对MORN的原型水平和有效性的改进。我们对四个流行的少数几个动作识别数据集进行了广泛的实验：HMDB51，UCF101，动力学和SSV2，MORN实现了最先进的结果。当将自豪感插入训练阶段时，可以进一步提高性能。

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a visual prototype-computed module. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular few-shot action recognition datasets: HMDB51, UCF101, Kinetics and SSv2, and MORN achieves state-of-the-art results. When plugging PRIDE into the training stage, the performance can be further improved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题