注意力引导的多式功能融合网络，用于播放视频中风险对象的定位

论文标题

注意力引导的多式功能融合网络，用于播放视频中风险对象的定位

An Attention-guided Multistream Feature Fusion Network for Localization of Risky Objects in Driving Videos

论文作者

Karim, Muhammad Monjurul, Qin, Ruwen, Yin, Zhaozheng

论文摘要

在由车辆安装的仪表板摄像机捕获的视频中检测危险交通代理（仪表板）对于促进在复杂环境中的安全导航至关重要。与事故相关的视频只是驾驶视频大数据的一小部分，并且瞬态前的进程流程高度动态和复杂。此外，风险和非风险交通代理的外观可能相似。这些使驾驶视频中的风险对象本地化特别具有挑战性。为此，本文提出了一个注意力引导的多式特征融合网络（AM-NET），以从仪表板视频中定位危险的交通代理。两个封闭式的复发单元（GRU）网络使用对象边界框和从连续视频帧中提取的光流功能来捕获时空提示，以区分危险的交通代理。加上GRU的注意力模块学会了与事故相关的交通代理。融合了两个功能流，AM-NET预测了视频中交通代理的风险得分。在支持这项研究的过程中，本文还引入了一个名为“风险对象本地化”（ROL）的基准数据集。该数据集包含事故，对象和场景级属性的空间，时间和分类注释。拟议的AM-NET在ROL数据集上实现了85.73％的AUC的有希望的性能。同时，在DOTA数据集上，AM-NET优于视频异常检测的当前最新检测。一项彻底的消融研究进一步揭示了AM-NET通过评估其不同组成部分的贡献的优点。

Detecting dangerous traffic agents in videos captured by vehicle-mounted dashboard cameras (dashcams) is essential to facilitate safe navigation in a complex environment. Accident-related videos are just a minor portion of the driving video big data, and the transient pre-accident processes are highly dynamic and complex. Besides, risky and non-risky traffic agents can be similar in their appearance. These make risky object localization in the driving video particularly challenging. To this end, this paper proposes an attention-guided multistream feature fusion network (AM-Net) to localize dangerous traffic agents from dashcam videos. Two Gated Recurrent Unit (GRU) networks use object bounding box and optical flow features extracted from consecutive video frames to capture spatio-temporal cues for distinguishing dangerous traffic agents. An attention module coupled with the GRUs learns to attend to the traffic agents relevant to an accident. Fusing the two streams of features, AM-Net predicts the riskiness scores of traffic agents in the video. In supporting this study, the paper also introduces a benchmark dataset called Risky Object Localization (ROL). The dataset contains spatial, temporal, and categorical annotations with the accident, object, and scene-level attributes. The proposed AM-Net achieves a promising performance of 85.73% AUC on the ROL dataset. Meanwhile, the AM-Net outperforms current state-of-the-art for video anomaly detection by 6.3% AUC on the DoTA dataset. A thorough ablation study further reveals AM-Net's merits by evaluating the contributions of its different components.

下载PDF全文

下载文献需遵守相关版权规定

论文标题