层次神经声码编码器的知识和数据驱动的振幅频谱预测

论文标题

层次神经声码编码器的知识和数据驱动的振幅频谱预测

Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

论文作者

Ai, Yang, Ling, Zhen-Hua

论文摘要

在我们以前的工作中，我们提出了一个名为HINET的神经声码器，该神经声码器通过从输入声学特征从层次层次预测幅度和相光谱来恢复语音波形。在HINET中，振幅光谱预测器（ASP）从输入声学特征中预测对数振幅光谱（LAS）。本文提出了一种新颖的知识和数据驱动的ASP（KDD-ASP），以改善传统的ASP。首先，声学特征（即F0和Mel-Cepstra）通过知识驱动的LAS恢复模块，以获得近似LAS（alas）。该模块是基于STFT和源滤波器理论的组合而设计的，其中源部分和过滤器部分分别基于输入F0和MEL-CEPSTRA设计。然后，通过数据驱动的LAS细化模块处理回收的ALA，该模块由多个可训练的卷积层组成，以获取最终的LAS。实验结果表明，使用KDD-ASP的HINET VOCODER比在文本到语音（TTS）任务上使用常规ASP和Wavernn Vocoder的合成语音质量更高。

In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional one. First, acoustic features (i.e., F0 and mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain approximate LAS (ALAS). This module is designed based on the combination of STFT and source-filter theory, in which the source part and the filter part are designed based on input F0 and mel-cepstra, respectively. Then, the recovered ALAS are processed by a data-driven LAS refinement module which consists of multiple trainable convolutional layers to get the final LAS. Experimental results show that the HiNet vocoder using KDD-ASP can achieve higher quality of synthetic speech than that using conventional ASP and the WaveRNN vocoder on a text-to-speech (TTS) task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题