论文标题
通过冷冻双向语言模型回答零拍的视频问题
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
论文作者
论文摘要
视频问题回答(VideoQA)是一项复杂的任务,需要多种模式数据进行培训。但是,对视频的问答答案的手动注释是乏味的,禁止可扩展性。为了解决这个问题,最近的方法考虑了零拍设置,而无需手动注释视觉问题。特别是,一种有前途的方法调整了在网络规模的纯文本数据中预审预测的自动回归语言模型,以适应多模式输入。相比之下,我们在这里建立在冷冻双向语言模型(BILM)的基础上,并表明这种方法为零拍的VideoQA提供了更强,更便宜的替代方案。特别是(i)我们使用轻型训练模块将视觉输入与冷冻的BILM相结合,(ii)我们使用网络绑带的多模式数据训练此类模块,最后(iii)我们通过掩盖的语言建模,在何处对给定的问题进行掩盖的文本,我们执行零声明的VideoQA推断。我们提出的方法Frozenbilm在零拍摄的视频中的表现优于最高的艺术,包括LSMDC-FIB,IVQA,MSRVTT-QA,MSVD-QA,activityNet-QA,tgif-qa,tgif-frameqa,how2qa和tveqa。它还在几次且完全监督的环境中展示了竞争性能。我们的代码和模型可在https://github.com/antoyang/frozenbilm上公开获取。
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.