论文标题
EGOTASKQA:以自我为中心视频中的人类任务
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
论文作者
论文摘要
通过视频观察了解人类任务是智能代理的重要能力。这种能力的挑战在于难以详细的理解对象状态(即状态变化)及其因果依赖性产生详细的理解。自然的并行性从多代理协作中的多任务和部分观察结果加剧了这些挑战。大多数先前的工作都利用行动的本地化或将来的预测作为间接指标,用于评估视频中的这种任务理解。为了进行直接评估,我们介绍了EGOTASKQA基准,该基准通过在现实世界中以现实世界中心的视频进行问题来为任务理解的关键维度提供了单一的住所。我们精心设计的问题是针对(1)行动依赖性和影响的理解,(2)意图和目标,以及(3)代理人对他人的信念。这些问题分为四种类型,包括描述性(什么状态?),预测性(将是什么?),解释性(是什么原因?)和反事实(如果?),以提供针对目标的空间,时间和因果关系的诊断分析。我们在基准上评估了最先进的视频推理模型,并显示了人类之间在理解复杂目标的自我中心视频方面的显着差距。我们希望这项努力将驱使视觉社区以面向目标的视频理解和推理前进。
Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.