时期VIT：端到端情感语音综合的明确俯仰建模的变化推断

论文标题

时期VIT：端到端情感语音综合的明确俯仰建模的变化推断

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

论文作者

Shirahata, Yuma, Yamamoto, Ryuichi, Song, Eunwoo, Terashima, Ryo, Kim, Jae-Min, Tachibana, Kentaro

论文摘要

已经提出了几种完全端到端的文本对语音（TTS）模型，这些模型与级联模型（即分别训练声学和Vocoder模型）相比显示出更好的性能。但是，当数据集包含情感属性，即发音和韵律的多样性时，它们通常会产生不稳定的音高轮廓。为了解决这个问题，我们提出了一个新型的端到端TTS模型，该模型包含了显式的周期性生成器。在提出的方法中，我们引入了一个框架音高预测变量，该预测指标可以从输入文本中预测韵律特征，例如音高和声音标志。从这些功能中，提出的周期性发生器产生样品级正弦源，使波形解码器能够准确地重现螺距。最后，整个模型以端到端的方式共同优化，并具有各种推理和对抗性目标。结果，解码器能够产生更稳定，表达和自然输出波形。实验结果表明，所提出的模型在自然性方面显着超过了基线模型，并且在产生的样品中提高了音高稳定性。

Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题