使用句子级调理对端到端语音综合的速度控制

论文标题

使用句子级调理对端到端语音综合的速度控制

Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning

论文作者

Bae, Jae-Sung, Bae, Hanbin, Joo, Young-Sun, Lee, Junmo, Lee, Gyeong-Hoon, Cho, Hoon-Young

论文摘要

本文提出了一个可控的端到端文本对语音（TTS）系统，以控制具有句子级的语言评分值的综合语音的语言速度（可控制的tts; sctts）作为附加输入。在拟议的系统中采用了语音速率值，输入音素与输入语音长度的比率以控制语言速度。此外，提出的SCTTS系统可以通过采用基于全球样式的基于令牌的样式编码器来保留其他语音属性（例如音调），同时保留其他语音属性。所提出的SCTT不需要任何其他训练有素的模型或外部语音数据库来提取音素级的持续时间信息，并且可以以端到端的方式进行培训。此外，我们对快速，正常和慢速语音的听力测试表明，SCTT可以比其他音素持续时间控制方法产生更自然的语音，这些音素持续时间控制方法在整个句子中以相同的速度增加或降低了持续时间，尤其是在慢速语音的情况下。

This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as an additional input. The speaking-rate value, the ratio of the number of input phonemes to the length of input speech, is adopted in the proposed system to control the speaking speed. Furthermore, the proposed SCTTS system can control the speaking speed while retaining other speech attributes, such as the pitch, by adopting the global style token-based style encoder. The proposed SCTTS does not require any additional well-trained model or an external speech database to extract phoneme-level duration information and can be trained in an end-to-end manner. In addition, our listening tests on fast-, normal-, and slow-speed speech showed that the SCTTS can generate more natural speech than other phoneme duration control approaches which increase or decrease duration at the same rate for the entire sentence, especially in the case of slow-speed speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题