OS-MSL：场景分割和分类的一个阶段多模式顺序链接框架

论文标题

OS-MSL：场景分割和分类的一个阶段多模式顺序链接框架

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

论文作者

Liu, Ye, Qiao, Lingfeng, Yin, Di, Jiang, Zhuoxuan, Jiang, Xinghua, Jiang, Deqiang, Ren, Bo

论文摘要

场景细分和分类（SSC）是迈向视频结构分析领域的关键步骤。直观地，共同学习这两个任务可以通过共享共同信息相互促进。但是，场景细分更多地涉及相邻镜头之间的局部差异，而分类需要场景段的全局表示，这可能导致该模型在训练阶段中由两个任务之一主导。在本文中，从替代角度克服上述挑战中，我们将这两个任务通过一种预测镜头链接的新形式团结到一个任务中：一个链接连接两个相邻的镜头，表明它们属于同一场景或类别。最后，我们提出了一个一般的单阶段多模式顺序链接框架（OS-MSL），以通过将两个学习任务改革为统一的任务来区分和利用两倍的语义。此外，我们量身定制一个称为diffcorrnet的特定模块，以明确提取镜头之间的差异和相关性信息。对从现实世界应用收集的全新大规模数据集进行了广泛的实验，并进行了电影塞。两种结果都证明了我们提出的方法针对强基础的有效性。

Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated by one of the two tasks in the training phase. In this paper, from an alternate perspective to overcome the above challenges, we unite these two tasks into one task by a new form of predicting shots link: a link connects two adjacent shots, indicating that they belong to the same scene or category. To the end, we propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage the two-fold semantics by reforming the two learning tasks into a unified one. Furthermore, we tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots. Extensive experiments on a brand-new large scale dataset collected from real-world applications, and MovieScenes are conducted. Both the results demonstrate the effectiveness of our proposed method against strong baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题