通过多模式合作对话代理描述看不见的视频

论文标题

通过多模式合作对话代理描述看不见的视频

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

论文作者

Zhu, Ye, Wu, Yu, Yang, Yi, Yan, Yan

论文摘要

由于对提供丰富敏感信息的直接访问的AI系统引起了人们的关注，研究人员试图开发具有隐式信息源的更可靠的AI。为此，在本文中，我们通过两个多模式合作对话代理介绍了一个名为“视频描述”的新任务，其最终目标是让一个对话代理根据对话框和两个静态框架来描述一个看不见的视频。具体来说，从视频的开始和结束时，给出了两个智能代理 - Q -bot - 在描述看不见的视频之前，从视频的开始和结束时都会给出两个静态框架。 A-Bot是已经看过整个视频的另一个代理商，通过为这些问题提供答案来帮助Q-Bot实现目标。我们提出了一个具有动态对话的QA合件网络，以更新学习机制，以将知识从A-Bot转移到Q-Bot，从而帮助Q-Bot更好地描述视频。广泛的实验表明，Q-Bot可以有效地学习描述所提出的模型和合作学习方法的看不见的视频，从而实现了有希望的表现，其中Q-Bot获得了完整的地面真相历史对话。

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog.

下载PDF全文

下载文献需遵守相关版权规定

论文标题