通过各种自动编码器学习和控制语音的源过滤器表示

论文标题

通过各种自动编码器学习和控制语音的源过滤器表示

Learning and controlling the source-filter representation of speech with a variational autoencoder

论文作者

Sadok, Samir, Leglaive, Simon, Girin, Laurent, Alameda-Pineda, Xavier, Séguier, Renaud

论文摘要

在深层生成模型中了解和控制潜在表示是一个具有挑战性但重要的问题，用于分析，转换和生成各种类型的数据。在语音处理中，从发声的解剖学机制启发，源滤波器模型认为，语音信号是从一些独立且物理上有意义的连续取得的持续较低因素中产生的，其中基本频率$ f_0 $和实架的主要重要性至关重要。在这项工作中，我们从一个无标记的自然语音信号的大型数据集中以无监督的方式训练的各种自动编码器（VAE）开始，我们表明，语音生产的源滤波器模型自然出现是VAE潜在空间的正交子空间。我们只使用用人造语音合成器生成的几秒钟的标记语音信号，我们提出了一种方法来识别编码$ f_0 $和前三个共振剂频率的潜在子空间，我们表明这些子空间是正交的，并且基于这种正交性，我们开发了一种准确和独立控制源源代码的方法。在不需要其他信息（例如文本或人体标记的数据）的情况下，这将导致语音谱图的深层生成模型，该模型以$ f_0 $和formant频率为条件，并应用于转换语音信号。最后，我们还提出了一种可靠的$ F_0 $估计方法，该方法将语音信号投影在与$ f_0 $相关的学习潜伏子空间上。

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题