论文标题
从大语言模型中学习视频表示
Learning Video Representations from Large Language Models
论文作者
论文摘要
我们介绍了Lavila,这是一种通过利用大型语言模型(LLM)来学习视频语言表示的新方法。我们将预先训练的LLMS重新利用以视觉输入为条件,并对其进行填充以创建自动视频叙述者。我们的自动生成的叙述提供了许多优势,包括对长视频的密集覆盖,视觉信息和文本的更好的时间同步以及文本的多样性更高。视频文本嵌入与这些其他自动生成的叙述相反地学习,在多个第一人称和第三人称视频任务上,以零射击和填充设置的方式优于先前的最先进的视频视频任务。最值得注意的是,Lavila在EGTEA分类中获得了10.1%的绝对增益和5.9%的Epic-Kitchens-100多企业检索基准。此外,Lavila只接受了EGO4D数据集的叙述的一半训练,优于在完整集合上训练的基线模型,并在增加预训练数据和模型大小时显示出积极的缩放行为。
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.