论文标题
学会用Mirrornet计算语音的插音表达
Learning to Compute the Articulatory Representations of Speech with the MIRRORNET
论文作者
论文摘要
包括人类在内的大多数生物通过协调和将感觉信号与运动动作进行协调和整合以生存和完成所需任务的功能。学习这些复杂的感觉运动映射是同时且经常以无监督或半监督的方式进行的。在这项工作中探索了受到这种感觉运动学习范式启发的自动编码器体系结构(MirrORNET),以控制一个关节合成器,并最少暴露于地面的表达数据。关节合成器将一组六个声道变量(电视)和源功能(声音指示器和音调)作为输入,并且能够为看不见的扬声器综合语音。我们表明,一旦初始化(具有约30分钟的发音数据)并以无监督的方式(“学习阶段”)进行了初始化的镜像,可以学习有意义的发音表达,其精度与以完全监督的方式训练的发音语音变形系统相当。
Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control an articulatory synthesizer, with minimal exposure to ground-truth articulatory data. The articulatory synthesizer takes as input a set of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch) and is able to synthesize continuous speech for unseen speakers. We show that the MirrorNet, once initialized (with ~30 mins of articulatory data) and further trained in unsupervised fashion (`learning phase'), can learn meaningful articulatory representations with comparable accuracy to articulatory speech-inversion systems trained in a completely supervised fashion.