论文标题

AOE-NET:实体相互作用建模与时间动作提案生成的自适应注意机制

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

论文作者

Vo, Khoa, Truong, Sang, Yamazaki, Kashu, Raj, Bhiksha, Tran, Minh-Triet, Le, Ngan

论文摘要

时间动作提案生成(TAPG)是一项具有挑战性的任务,它需要在未修剪视频中进行本地化操作间隔。从直觉上讲,我们作为人类,通过参与者,相关对象和周围环境之间的相互作用来感知行动。尽管TAPG取得了重大进展,但绝大多数现有方法通过将骨干网络应用于给定的视频作为黑盒,忽略了人类感知过程的上述原则。在本文中,我们建议将这些相互作用与多模式表示网络(即Actors-Objects-Environment Enteraction网络(AOE-NET))建模。我们的AOE-NET由两个模块组成,即基于感知的多模式表示(PMR)和边界匹配模块(BMM)。此外,我们在PMR中引入自适应注意机制(AAM),仅专注于主要参与者(或相关对象),并建模它们之间的关系。 PMR模块通过视觉语言特征代表每个视频片段,其中主演员和周围环境由视觉信息表示,而相关对象则通过图像文本模型由语言特征描绘出来。 BMM模块将视觉语言特征的序列作为其输入,并生成动作建议。对ActivityNet-1.3和Thumos-14数据集的全面实验和广泛的消融研究表明,我们提出的AOE-NET优于先前的最先进方法,具有出色的性能和概括为TAPG和时间动作检测。为了证明AOE-NET的鲁棒性和有效性,我们进一步进行了以自我为中心视频的消融研究,即Epic-Kitchens 100数据集。源代码可在接受后获得。

Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human perceiving process by applying a backbone network into a given video as a black-box. In this paper, we propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net). Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM). Additionally, we introduce adaptive attention mechanism (AAM) in PMR to focus only on main actors (or relevant objects) and model the relationships among them. PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model. BMM module processes the sequence of visual-linguistic features as its input and generates action proposals. Comprehensive experiments and extensive ablation studies on ActivityNet-1.3 and THUMOS-14 datasets show that our proposed AOE-Net outperforms previous state-of-the-art methods with remarkable performance and generalization for both TAPG and temporal action detection. To prove the robustness and effectiveness of AOE-Net, we further conduct an ablation study on egocentric videos, i.e. EPIC-KITCHENS 100 dataset. Source code is available upon acceptance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源