VTTS：视文到语音

论文标题

VTTS：视文到语音

vTTS: visual-text to speech

论文作者

Nakano, Yoshifumi, Saeki, Takaaki, Takamichi, Shinnosuke, Sudoh, Katsuhito, Saruwatari, Hiroshi

论文摘要

本文提出了视觉文本到语音（VTTS），这是一种从视觉文本中综合语音的方法（即文本作为图像）。常规的TTS将音素或字符转换为离散符号，并从中综合了语音波形，从而失去了字符本质上具有的视觉特征。因此，我们的方法将语音不是来自离散符号，而是来自视觉文本的语音。所提出的VTTs通过卷积神经网络提取视觉特征，然后以fastspeech2启发的非自动回忆模型生成声学特征。实验结果表明，1）VTT能够产生与传统TT相比或更好的自然性的语音，2）它可以将视觉文本中的重点和情感属性转移到语音中，而无需其他标签和架构，3）它可以合成比传统的TTS中更自然，更自然的语音。

This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text. The proposed vTTS extracts visual features with a convolutional neural network and then generates acoustic features with a non-autoregressive model inspired by FastSpeech2. Experimental results show that 1) vTTS is capable of generating speech with naturalness comparable to or better than a conventional TTS, 2) it can transfer emphasis and emotion attributes in visual text to speech without additional labels and architectures, and 3) it can synthesize more natural and intelligible speech from unseen and rare characters than conventional TTS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题