论文标题
关于神经文本到语音的韵律表示学习和上下文抽样
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
论文作者
论文摘要
在本文中,我们介绍了Kathaka,该模型是通过新颖的两阶段训练过程,用于神经言语综合,具有适当的韵律。在第一阶段,我们从训练过程中可用的MEL-SPECTROGRAM上学习了句子级别的韵律分布。在第二阶段,我们提出了一种新颖的方法,可以使用文本中可用的上下文信息从这种学到的韵律分布中进行采样。为此,我们在文本上使用BERT,以及从文本中提取的解析树上使用图形注意网络。与录音相比,我们显示出强大的基线的统计学显着相对相对改善的自然性$ 13.2 \%。我们还对采样技术的变化进行了消融研究,并在每种情况下都表现出对基线的统计学显着改善。
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.