仔细观察教学视频分割的时间顺序

论文标题

仔细观察教学视频分割的时间顺序

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos

论文作者

Batra, Anil, Gowda, Shreyank N, Keller, Frank, Sevilla-Lara, Laura

论文摘要

了解执行任务所需的步骤是AI系统的重要技能。从教学视频中学习这些步骤涉及两个子问题：（i）确定顺序发生的段的时间边界，以及（ii）以自然语言汇总这些步骤。我们将此任务称为过程分割和摘要（PSS）。在本文中，我们仔细研究了PSS，并提出了对当前方法的三个基本改进。分割任务至关重要，因为生成正确的摘要需要正确识别过程的每个步骤。但是，当前的细分指标通常高估了细分质量，因为它们不考虑细分的时间顺序。在我们的第一个贡献中，我们提出了一个新的分割度量，该指标考虑了细分市场的顺序，从而更可靠地衡量了给定预测的分割的准确性。当前的PSS方法通常是通过提出细分，将其与地面真相相匹配并计算损失的训练。但是，与细分指标一样，现有的匹配算法并不考虑候选段与地面真相之间映射的时间顺序。在我们的第二个贡献中，我们提出了一种匹配算法，该算法约束段映射的时间顺序，并且也是可区分的。最后，我们为PSS引入了多模式功能培训，这进一步改善了细分。我们在两个教学视频数据集（YouCook2和Tasty）上评估了我们的方法，并分别观察到$ \ sim7 \％$和$ \ sim2.5 \％$的最先进的过程细分和摘要。

Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look at PSS and propose three fundamental improvements over current methods. The segmentation task is critical, as generating a correct summary requires each step of the procedure to be correctly identified. However, current segmentation metrics often overestimate the segmentation quality because they do not consider the temporal order of segments. In our first contribution, we propose a new segmentation metric that takes into account the order of segments, giving a more reliable measure of the accuracy of a given predicted segmentation. Current PSS methods are typically trained by proposing segments, matching them with the ground truth and computing a loss. However, much like segmentation metrics, existing matching algorithms do not consider the temporal order of the mapping between candidate segments and the ground truth. In our second contribution, we propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable. Lastly, we introduce multi-modal feature training for PSS, which further improves segmentation. We evaluate our approach on two instructional video datasets (YouCook2 and Tasty) and observe an improvement over the state-of-the-art of $\sim7\%$ and $\sim2.5\%$ for procedure segmentation and summarization, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题