论文标题

Sound2ssight:从声音和上下文中生成视觉动力学

Sound2Sight: Generating Visual Dynamics from Sound and Context

论文作者

Cherian, Anoop, Chatterjee, Moitreya, Ahuja, Narendra

论文摘要

跨模式的学习协会对于强大的多模式推理至关重要,尤其是在推断过程中可能缺少模态时。在本文中,我们在音频条件的视觉合成背景下研究了这个问题 - 例如,在遮挡推理中很重要。具体来说,我们的目标是生成未来的视频帧及其运动动态,以音频和过去的几个框架为条件。为了解决这个问题,我们提出Sound2sight,这是一个深层的变分框架,经过训练,可以在音频和过去的框架的关节嵌入方式上学习每个框架随机先验的条件。通过基于多头注意的视听变压器编码器来学习这种嵌入。然后对所学的先验进行采样,以进一步调节视频预测模块以生成未来的帧。随机先验允许模型采样与所提供的音频和过去上下文一致的多个合理期货。此外,为了提高生成的帧的质量和连贯性,我们提出了一个多模式歧视器,该歧视器可区分合成和真实的视听剪辑。我们在两个新数据集上进行了经验评估我们的方法,相对于先前的方法密切相关。 (i)具有惊喜障碍物的多模式随机移动MNIST,(ii)YouTube绘画;以及现有的音频集鼓数据集。我们的广泛实验表明,Sound2sight在生成的视频质量中显着优于最新技术,同时还产生了多样化的视频内容。

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures that are consistent with the provided audio and the past context. Moreover, to improve the quality and coherence of the generated frames, we propose a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. We empirically evaluate our approach, vis-á-vis closely-related prior methods, on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums dataset. Our extensive experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality, while also producing diverse video content.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源