非竞争性塔科特人：鲁棒和可控的神经TT合成，包括无监督的持续时间建模

论文标题

非竞争性塔科特人：鲁棒和可控的神经TT合成，包括无监督的持续时间建模

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

论文作者

Shen, Jonathan, Jia, Ye, Chrzanowski, Mike, Zhang, Yu, Elias, Isaac, Zen, Heiga, Wu, Yonghui

论文摘要

本文根据TACOTRON 2文本到语音模型提出了非竞争性TACOTRON，用明确的持续时间预测器代替了注意机制。这可以显着提高鲁棒性，如未对齐的持续时间比和单词删除率所测量的，本文在本文中介绍了两个指标，以使用预先训练的语音识别模型进行大规模鲁棒性评估。随着使用高斯上升采样的使用，非竞争性塔科特人获得了4.41的自然性的5级平均意见评分，略高于表现的tacotron 2。持续时间预测指标可以在推理时间范围内和持续时间控制持续时间。当训练数据中准确的目标持续时间稀缺或无法使用时，我们提出了一种使用精细粒度变分的自动编码器以半监督或不监督的方式训练持续时间预测变量的方法，其结果几乎与受监督的培训一样好。

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题