使用变异自动编码器中的无声视频中的语音预测

论文标题

使用变异自动编码器中的无声视频中的语音预测

Speech Prediction in Silent Videos using Variational Autoencoders

论文作者

Yadav, Ravindra, Sardana, Ashish, Namboodiri, Vinay P, Hegde, Rajesh M

论文摘要

了解听觉和视觉信号之间的关系对于许多不同的应用程序至关重要，从计算机生成的图像（CGI）和视频编辑自动化到帮助有听力或视觉障碍的人。但是，这是具有挑战性的，因为音频和视觉模态的分布本质上是多模式的。因此，大多数现有方法都忽略了多模式方面，并假定存在两种模式之间的确定性一对一映射。随着模型崩溃以优化平均行为而不是学习完整的数据分布，它可能导致低质量的预测。在本文中，我们提出了一个随机模型，用于在无声视频中生成语音。所提出的模型结合了经常性的神经网络和变异的深层生成模型，以了解视觉信号的听觉信号的条件分布。我们根据标准基准在网格数据集上演示了模型的性能。

Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题