大型多语言视觉配音

论文标题

大型多语言视觉配音

Large-scale multilingual audio visual dubbing

论文作者

Yang, Yi, Shillingford, Brendan, Assael, Yannis, Wang, Miaosen, Liu, Wendi, Chen, Yutian, Zhang, Yu, Sezener, Eren, Cobo, Luis C., Denil, Misha, Aytar, Yusuf, de Freitas, Nando

论文摘要

我们描述了一个用于大规模视听翻译和配音的系统，该系统将视频从一种语言转换为另一种语言。源语言的语音内容使用原始说话者的声音转录为文本，翻译和自动合成为目标语言语音。视觉内容是通过合成唇部动作的扬声器来翻译的，以匹配翻译的音频，从而在目标语言中创建无缝的视听体验。音频和视觉翻译子系统每个都包含一个大规模的通用合成模型，该模型在相应域中数千个小时的数据进行了训练。这些通用模型在翻译之前对特定扬声器进行微调，要么使用目标扬声器的数据辅助语料库，要么使用视频将其自身翻译为微调过程的输入。该报告提供了完整系统的架构概述，以及对视频配音组件的深入讨论。概述了音频和文本组件与完整系统的作用，但并未详细讨论它们的设计。可以在https://www.youtube.com/playlist?list = plsi232j2za6_1exhof5vndzyfbxahhes5上查看使用我们系统生成的翻译和配音视频。

We describe a system for large-scale audiovisual translation and dubbing, which translates videos from one language to another. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice. The visual content is translated by synthesizing lip movements for the speaker to match the translated audio, creating a seamless audiovisual experience in the target language. The audio and visual translation subsystems each contain a large-scale generic synthesis model trained on thousands of hours of data in the corresponding domain. These generic models are fine-tuned to a specific speaker before translation, either using an auxiliary corpus of data from the target speaker, or using the video to be translated itself as the input to the fine-tuning process. This report gives an architectural overview of the full system, as well as an in-depth discussion of the video dubbing component. The role of the audio and text components in relation to the full system is outlined, but their design is not discussed in detail. Translated and dubbed demo videos generated using our system can be viewed at https://www.youtube.com/playlist?list=PLSi232j2ZA6_1Exhof5vndzyfbxAhhEs5

下载PDF全文

下载文献需遵守相关版权规定

论文标题