fastlts：非自动回归端到端的无约束口口口相传综合

论文标题

fastlts：非自动回归端到端的无约束口口口相传综合

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

论文作者

Wang, Yongqi, Zhao, Zhou

论文摘要

不受约束的口红到语音综合旨在从无声的谈话面孔中产生相应的演讲，而无需限制头部姿势或词汇。当前的作品主要使用序列到序列模型来解决此问题，无论是自动回归体系结构还是基于流动的非自动回旋架构。但是，这些模型具有几个缺点：1）而不是直接生成音频，而是使用两阶段的管道，该管道首先生成MEL-SPECTROGRAM，然后从频谱图中重建音频。这导致由于错误传播而引起的语音质量的繁琐部署和降解； 2）这些模型使用的音频重建算法限制了推理速度和音频质量，而这些模型的神经声码器不可用，因为它们的输出谱图不够准确； 3）自回旋模型遭受了高推断潜伏期的影响，而基于流的模型的内存占用率很高：它们在时间和内存使用方面都没有足够的效率。为了解决这些问题，我们提出了FASTLTs，这是一种非自动回调的端到端模型，可以直接从低潜伏期的无约束的会话视频中直接合成高质量的语音音频，并且模型大小相对较小。此外，与广泛使用的3D-CNN视觉前端用于唇部运动编码不同，我们首次为此任务提出了基于变压器的视觉前端。实验表明，与当前的3秒输入序列上的当前自动回归模型相比，我们的模型可实现音频波形的$ 19.76 \ times $速度，并获得了卓越的音频质量。

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves $19.76\times$ speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题