论文标题
通过时空CNN的信息丰富的信息采样技术用于视频中的人类行为分类
An Information-rich Sampling Technique over Spatio-Temporal CNN for Classification of Human Actions in Videos
论文作者
论文摘要
我们使用基于3维的卷积神经网络(3D CNN)分类器提出了一种新型的视频中人类动作识别方案。传统上,在基于深度学习的人类活动识别方法中,要进行几个随机框架或视频的每一个$ k^{th} $框架用于训练3D CNN,其中$ k $是一个很小的积极整数,例如4、5或6。这种样品的样品可以减少输入数据的数量,从而加快了网络训练的范围,从而避免了范围的范围,从而可以增强范围的范围,从而可以增强某些范围,从而可以使某些范围的范围降低,从而降低了某些范围,从而避免了某些范围,从而避免了某些范围的范围,从而避免了某些范围的范围。在拟议的视频采样技术中,视频的连续$ k $帧通过计算$ k $帧的高斯加权总和来汇总到单个帧中。所得框架(聚合框架)比常规方法更好地保留信息,并实验表明表现更好。在本文中,提出了一个3D CNN结构来提取时空特征,并遵循长期记忆(LSTM)以识别人类行为。提出的3D CNN体系结构能够处理与表演者距离距离距离的视频。实验是使用KTH和Weizmann人类动作数据集进行的,从而证明它可以通过最新技术产生可比的结果。
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier. Traditionally in deep learning based human activity recognition approaches, either a few random frames or every $k^{th}$ frame of the video is considered for training the 3D CNN, where $k$ is a small positive integer, like 4, 5, or 6. This kind of sampling reduces the volume of the input data, which speeds-up training of the network and also avoids over-fitting to some extent, thus enhancing the performance of the 3D CNN model. In the proposed video sampling technique, consecutive $k$ frames of a video are aggregated into a single frame by computing a Gaussian-weighted summation of the $k$ frames. The resulting frame (aggregated frame) preserves the information in a better way than the conventional approaches and experimentally shown to perform better. In this paper, a 3D CNN architecture is proposed to extract the spatio-temporal features and follows Long Short-Term Memory (LSTM) to recognize human actions. The proposed 3D CNN architecture is capable of handling the videos where the camera is placed at a distance from the performer. Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.