论文标题
深度设置条件的潜在表示行动识别
Deep set conditioned latent representations for action recognition
论文作者
论文摘要
近年来,多标签的多级视频动作识别已获得了巨大的知名度。对于智能物种而言,暂时连接的原子行为的推理是平凡的,但标准人工神经网络(ANN)仍在难以对其进行分类。在现实世界中,原子行动通常在时间上连接以形成更复杂的复合动作。挑战在于识别不同持续时间的综合作用,而其他不同的复合或原子作用发生在背景中。利用关系网络的成功,我们提出了学会推理对象和动作的语义概念的方法。我们从经验上展示了ANN如何从训练,关系归纳偏见和无序的基于集合的潜在表示中受益。在本文中,我们提出了深度设置的条件I3D(SCI3D),这是一个两个流关系网络,该网络采用了状态和视觉表示的潜在表示,用于对事件和动作进行推理。他们学会推理有关时间连接的动作,以便在视频中识别所有这些动作。所提出的方法在cater数据集上的I3D-NL基线上,在原子能识别中的MAP和复合动作识别中的17.57%MAP的提高了约1.49%的地图。
In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.