论文标题
小提琴:用于视频和语言推理的大型数据集
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
论文作者
论文摘要
我们介绍了一项新任务,视频和语言推断,以共同对视频和文本的多模式理解。鉴于一个带有对齐字幕作为前提的视频剪辑,并根据视频内容与自然语言假设配对,模型需要推断出假设是由给定的视频剪辑所带来的还是矛盾的。为此任务介绍了一个新的大型数据集,称为小提琴(视频和语言推断),由15,887个视频剪辑中的95,322个视频 - 假设对组成,涵盖了超过582个小时的视频。这些视频剪辑包含丰富的内容,这些内容具有不同的时间动态,事件变化和从两个来源收集的人互动:(i)受欢迎的电视节目,以及(ii)YouTube频道的电影剪辑。为了解决我们的新多模式推理任务,需要模型才能拥有复杂的推理技能,从表面级接地(例如,识别视频中的对象和角色)到深入的常识性推理(例如,推断视频中事件的因果关系)。我们对数据集进行了详细的分析,并对许多强大的基线进行了广泛的评估,为这项新任务的挑战提供了宝贵的见解。
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task.