只是问：学会回答数百万叙述视频的问题

论文标题

只是问：学会回答数百万叙述视频的问题

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

论文作者

Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, Schmid, Cordelia

论文摘要

视觉问题回答的最新方法取决于大规模注释的数据集。但是，视频的问题和答案的手动注释乏味，昂贵，可阻止可扩展性。在这项工作中，我们建议避免手动注释，并生成一个大规模培训数据集，用于视频问题，以回答利用自动跨模式监督。我们利用了对文本数据训练的问题产生变压器的问题，并使用它来从转录的视频叙述中生成问答对。给定的叙述视频，我们将自动使用69m视频问题 - 答案三胞胎生成Howtovqa69m数据集。为了处理此数据集中各种答案的开放词汇，我们提出了一个基于视频问题多模式变压器和答案变压器之间的对比损失的培训程序。我们介绍了零拍的视频QA任务，并显示出极好的结果，特别是对于罕见答案。此外，我们证明了在MSRVTT-QA，MSVD-QA，ActivityNet-QA和HOW2QA上显着胜过最佳状态的方法。最后，为了进行详细的评估，我们介绍了IVQA，这是一个新的VideoQA数据集，具有减少语言偏见和高质量的冗余手动注释。我们的代码，数据集和训练有素的模型可在https://antoyang.github.io/just-ask.html上找到。

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.

下载PDF全文

下载文献需遵守相关版权规定

论文标题