iPerceive：将常识性推理应用于多模式密集的视频字幕和视频问题回答

论文标题

iPerceive：将常识性推理应用于多模式密集的视频字幕和视频问题回答

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

论文作者

Chadha, Aman, Arora, Gurneet, Kaloty, Navpreet

论文摘要

视觉理解中的大多数先前的艺术仅依赖于分析“什么”（例如，事件识别）和“ where”（例如，事件定位），在某些情况下，这些（事件定位）未能描述事件之间的正确上下文关系或导致不正确的视觉关注。我们将我们定义为人类和根本不同的部分是我们的本能，是我们本能在任何关联背后寻求因果关系的本能，例如事件是事件X的直接结果。为此，我们提出了一个框架，一个框架，能够理解视频中的视频基础，使用视频之间的情境关系来构建视频之间的“为什么”事件之间的“为什么”事件。我们使用密集的视频字幕（DVC）和视频答案（VideoQA）任务来证明我们的技术有效性。此外，尽管DVC和VideoQA的大多数先前工作仅依赖于视觉信息，但其他方式（例如音频和语音）对于人类观察者对环境的看法至关重要。我们将DVC和VideoQA任务作为使用多种模式的机器翻译问题。通过分别评估Ipeceive DVC和Ipeceive VideoQA在ActivityNet字幕和TVQA数据集上的性能，我们表明我们的方法可以进一步提高最新的方式。代码和样品可在以下网址找到：ipeceive.amanchadha.com。

Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA relies solely on visual information, other modalities such as audio and speech are vital for a human observer's perception of an environment. We formulate DVC and VideoQA tasks as machine translation problems that utilize multiple modalities. By evaluating the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively, we show that our approach furthers the state-of-the-art. Code and samples are available at: iperceive.amanchadha.com.

下载PDF全文

下载文献需遵守相关版权规定

论文标题