论文标题
使用音频和视频流进行了改进的足球动作发现
Improved Soccer Action Spotting using both Audio and Video Streams
论文作者
论文摘要
在本文中,我们提出了一项关于足球视频中多模式(音频和视频)动作发现和分类的研究。动作发现和分类是在视频中找到事件的时间锚的任务,并确定它们是哪个事件。这是一般活动理解的重要应用。在这里,我们提出了一项实验研究,以在深度神经网络体系结构的不同阶段组合音频和视频信息。我们使用了Soccernet基准数据集,该数据集包含来自五个欧洲大联盟的500个足球比赛视频的带注释的事件。通过这项工作,我们评估了几种将音频流集成到基于视频的架构中的方法。我们观察到针对操作分类任务的平均平均精度(MAP)度量为$ 7.43 \%$的平均绝对改善,对于操作发现任务的平均平均精度(MAP)度量为$ 4.19 \%$。
In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and video information at different stages of deep neural network architectures. We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues. Through this work, we evaluated several ways to integrate audio stream into video-only-based architectures. We observed an average absolute improvement of the mean Average Precision (mAP) metric of $7.43\%$ for the action classification task and of $4.19\%$ for the action spotting task.