学习单独的口语风格，以准确的嘴唇到语音综合

论文标题

学习单独的口语风格，以准确的嘴唇到语音综合

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

论文作者

Prajwal, K R, Mukhopadhyay, Rudrabha, Namboodiri, Vinay, Jawahar, C V

论文摘要

当语音缺乏或因外部噪音而损坏时，人类不由自主地倾向于从唇部运动中推断出对话的一部分。在这项工作中，我们探讨了嘴唇到语音综合的任务，即，仅考虑说话者的唇部动作，学习产生自然语音。认识到上下文和说话者特定的提示对于准确的唇部阅读的重要性，我们采取了与现有作品不同的途径。我们专注于在不受限制的大型词汇环境中为单个说话者学习准确的唇部序列。为此，我们收集并发布了一个大规模的基准数据集，这是第一个此类数据集，专门用于训练和评估单扬声器唇部至自然设置中的语音任务。我们提出了一种新颖的方法，采用关键设计选择，以在这种不受约束的情况下实现准确，自然的语言综合。使用定量，定性指标和人类评估的广泛评估表明，我们的方法比该领域的以前的作品可理解四倍。请查看我们的演示视频，以快速概述论文，方法和定性结果。 https://www.youtube.com/watch?v=hzia-jmlk_4&feature=youtu.be

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.be

下载PDF全文

下载文献需遵守相关版权规定

论文标题