论文标题
由多模式教师教授的学生是卓越的行动识别者
Students taught by multimodal teachers are superior action recognizers
论文作者
论文摘要
以自我为中心的视频理解的焦点是建模手动对象相互作用。标准模型 - CNN,视觉变压器等 - 接收RGB帧作为输入表现良好,但是,它们的性能通过采用其他模式(例如对象检测,光流,音频等)作为输入而进一步提高。另一方面,所需模式的模块的附加复杂性使这些模型在部署中不切实际。这项工作的目的是保留此类多模式方法的性能,同时仅将RGB图像用作推理时间的输入。我们的方法基于多模式知识蒸馏,具有多模式教师(在当前的实验中,仅使用对象检测,光流和RGB帧训练)和单峰学生(仅使用RGB帧作为输入)。我们提出了初步结果,该结果表明,从多模式教师提取的结果模型显着优于基线RGB模型(在没有知识蒸馏的情况下训练)以及自身的杂食性版本(共同训练了所有模式),在标准和组成动作识别中均已训练。
The focal point of egocentric video understanding is modelling hand-object interactions. Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc. as input. The added complexity of the required modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time. Our approach is based on multimodal knowledge distillation, featuring a multimodal teacher (in the current experiments trained only using object detections, optical flow and RGB frames) and a unimodal student (using only RGB frames as input). We present preliminary results which demonstrate that the resulting model -- distilled from a multimodal teacher -- significantly outperforms the baseline RGB model (trained without knowledge distillation), as well as an omnivorous version of itself (trained on all modalities jointly), in both standard and compositional action recognition.