GraphTts：神经文本到语音中的图形到序列建模

论文标题

GraphTts：神经文本到语音中的图形到序列建模

GraphTTS: graph-to-sequence modelling in neural text-to-speech

论文作者

Sun, Aolan, Wang, Jianzong, Cheng, Ning, Peng, Huayi, Zeng, Zhen, Xiao, Jing

论文摘要

本文在神经文本到语音（GraphTTS）中利用图形对序列方法，该方法将输入序列的图嵌入到频谱图中。图形输入由从输入文本构建的节点和边缘表示组成。这些图形输入的编码结合了GNN编码器模块的语法信息。此外，将GraphTTS的编码器应用于图形辅助编码器（GAE）可以分析文本语义结构中的韵律信息。这可以删除参考音频过程的手动选择，并使韵律建模成为端到端过程。实验分析表明，GraphTTs在平均值评分（MOS）中优于0.24的最先进序列模型。 GAE可以自动调整合成音频的暂停，通风和音调。这个实验性结论可能会给研究改善语音合成韵律的研究人员提供一些灵感。

This paper leverages the graph-to-sequence method in neural text-to-speech (GraphTTS), which maps the graph embedding of the input sequence to spectrograms. The graphical inputs consist of node and edge representations constructed from input texts. The encoding of these graphical inputs incorporates syntax information by a GNN encoder module. Besides, applying the encoder of GraphTTS as a graph auxiliary encoder (GAE) can analyse prosody information from the semantic structure of texts. This can remove the manual selection of reference audios process and makes prosody modelling an end-to-end procedure. Experimental analysis shows that GraphTTS outperforms the state-of-the-art sequence-to-sequence models by 0.24 in Mean Opinion Score (MOS). GAE can adjust the pause, ventilation and tones of synthesised audios automatically. This experimental conclusion may give some inspiration to researchers working on improving speech synthesis prosody.

下载PDF全文

下载文献需遵守相关版权规定

论文标题