论文标题
将人类目光整合到以自我为中心活动识别的注意力
Integrating Human Gaze into Attention for Egocentric Activity Recognition
论文作者
论文摘要
众所周知,人类凝视具有有关视觉关注的重要信息。但是,将目光数据纳入深度神经网络的注意机制中存在三个主要困难:1)凝视固定点可能由于闪烁和快速的眼睛运动而可能存在测量误差; 2)目前尚不清楚凝视数据与视觉注意力相关的何时以及多少; 3)在许多现实情况下,凝视数据并不总是可用。在这项工作中,我们引入了一种有效的概率方法,将人类凝视整合到时空的关注中,以识别以自我为中心活动。具体而言,我们将凝视点的位置表示为结构化离散的潜在变量,以建模其不确定性。此外,我们使用变分方法对凝视固定的分布进行建模。凝视分布是在训练过程中学习的,因此在测试情况下不再需要注视位置的基本真相注释,因为它们可以从学习的凝视分布中进行预测。预测的目光位置用于提供信息丰富的注意线索以提高识别性能。我们的方法的表现优于EGTEA先前所有最新方法,这是一个大规模的数据集,用于以凝视测量为中心活动。我们还进行了消融研究和定性分析,以证明我们的注意力机制是有效的。
It is well known that human gaze carries significant information about visual attention. However, there are three main difficulties in incorporating the gaze data in an attention mechanism of deep neural networks: 1) the gaze fixation points are likely to have measurement errors due to blinking and rapid eye movements; 2) it is unclear when and how much the gaze data is correlated with visual attention; and 3) gaze data is not always available in many real-world situations. In this work, we introduce an effective probabilistic approach to integrate human gaze into spatiotemporal attention for egocentric activity recognition. Specifically, we represent the locations of gaze fixation points as structured discrete latent variables to model their uncertainties. In addition, we model the distribution of gaze fixations using a variational method. The gaze distribution is learned during the training process so that the ground-truth annotations of gaze locations are no longer needed in testing situations since they are predicted from the learned gaze distribution. The predicted gaze locations are used to provide informative attentional cues to improve the recognition performance. Our method outperforms all the previous state-of-the-art approaches on EGTEA, which is a large-scale dataset for egocentric activity recognition provided with gaze measurements. We also perform an ablation study and qualitative analysis to demonstrate that our attention mechanism is effective.