论文标题
实时MRI视频综合来自时间与序列到序列网络对齐的音素
Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks
论文作者
论文摘要
言语生产研究感兴趣的嘴巴中部平面的实时磁共振成像(RTMRI)。在这项工作中,我们专注于从口语音素序列估算RTMRI视频。我们从强制比对获得时间对齐的音素,以获得与RTMRI帧对齐的帧级音素序列。我们建议使用变压器音素编码器和卷积框架解码器提出序列学习模型。然后,我们通过使用从预处理的音素条件变异自动编码器(CVAE)中获得的中介特征来修改学习。我们以特定于主题的方式对8个受试者进行训练,并通过主观测试来证明表现。我们还使用空气组织边界(ATB)分割的辅助任务来获得所提出模型的客观分数。我们表明,所提出的方法能够生成现实的RTMRI视频,以表现出看不见的话语,并且添加CVAE对于学习映射很难学习的主题的序列到序列映射非常有益。
Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learning model with a transformer phoneme encoder and convolutional frame decoder. We then modify the learning by using intermediary features obtained from sampling from a pretrained phoneme-conditioned variational autoencoder (CVAE). We train on 8 subjects in a subject-specific manner and demonstrate the performance with a subjective test. We also use an auxiliary task of air tissue boundary (ATB) segmentation to obtain the objective scores on the proposed models. We show that the proposed method is able to generate realistic rtMRI video for unseen utterances, and adding CVAE is beneficial for learning the sequence-to-sequence mapping for subjects where the mapping is hard to learn.