论文标题
通过视觉性的现实视频摘要:一个新的基准和评估框架
Realistic Video Summarization through VISIOCITY: A New Benchmark and Evaluation Framework
论文作者
论文摘要
由于几个挑战,自动视频摘要仍然是一个未解决的问题。我们采取步骤来使自动视频摘要通过解决这些摘要更现实。首先,当前可用的数据集具有非常短的视频,或者只有很少的特定类型的视频。我们介绍了一种新的基准测试数据集粘度性,其中包括跨六个不同类别的较长视频,这些视频具有密集的概念注释,能够支持视频摘要的不同口味,并且可用于其他视觉问题。其次,对于长期视频,很难获得人类参考摘要。我们提出了一种基于帕累托最优性的新颖食谱,可以自动从视觉性中存在的间接地面真相产生多个参考摘要。我们表明这些摘要与人类摘要相提并论。第三,我们证明,在存在多个基础真理摘要(由于任务的高度主观性质)的情况下,从单个组合的地面真理摘要中学习单个损失函数并不是一个好主意。我们提出了一个简单的食谱粘性和使用损失的组合来增强现有模型,并证明在粘性测试时,它会超过当前的最新技术状态。我们还表明,与当前的典型实践一样,一项评估摘要的措施也很短。我们提出了一个框架,以更好地定量评估摘要质量,该框架比单一措施更接近人类的判断。我们报告了使用各种措施评估的视觉效果的一些代表性摘要技术的性能,并提出了对人类判断建模的技术和/或评估机制的局限性,并证明了我们评估框架的有效性。
Automatic video summarization is still an unsolved problem due to several challenges. We take steps towards making automatic video summarization more realistic by addressing them. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories with dense concept annotations capable of supporting different flavors of video summarization and can be used for other vision problems. Secondly, for long videos, human reference summaries are difficult to obtain. We present a novel recipe based on pareto optimality to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We show that these summaries are at par with human summaries. Thirdly, we demonstrate that in the presence of multiple ground truth summaries (due to the highly subjective nature of the task), learning from a single combined ground truth summary using a single loss function is not a good idea. We propose a simple recipe VISIOCITY-SUM to enhance an existing model using a combination of losses and demonstrate that it beats the current state of the art techniques when tested on VISIOCITY. We also show that a single measure to evaluate a summary, as is the current typical practice, falls short. We propose a framework for better quantitative assessment of summary quality which is closer to human judgment than a single measure, say F1. We report the performance of a few representative techniques of video summarization on VISIOCITY assessed using various measures and bring out the limitation of the techniques and/or the assessment mechanism in modeling human judgment and demonstrate the effectiveness of our evaluation framework in doing so.