共同语音手势动画的样式转移：多演讲者的条件混合方法

论文标题

共同语音手势动画的样式转移：多演讲者的条件混合方法

Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

论文作者

Ahuja, Chaitanya, Lee, Dong Won, Nakano, Yukiko I., Morency, Louis-Philippe

论文摘要

我们如何教机器人或虚拟助手自然手势？我们可以走得更远，调整手势风格以跟随特定的扬声器吗？在人类交流期间，自然定时使用相应语音的手势称为共同语音手势。一个称为手势风格转移的关键挑战是学习一个模型，该模型以目标扬声器“ b”的手势风格“ a”生成这些手势。第二个目标是同时学会为多个演讲者生成共同语音的手势，同时记住每个演讲者独特的东西。我们称这种挑战风格保存。在本文中，我们提出了一个名为Mix stage的新模型，该模型训练了多个扬声器的单个模型，同时以端到端的方式学习每个说话者的手势的独特样式嵌入。混合阶段的新颖性是学习生成模型的混合物，该模型允许根据每个扬声器的独特手势进行调节。由于混音阶段的删除样式和手势的内容，只需切换样式的嵌入方式就可以更改相同输入语音的手势样式。混音阶段还可以从多个扬声器同时学习时进行样式保存。我们还介绍了一个新的数据集，即姿势 - 审计 - 转录风格（PATS），旨在研究手势的产生和样式转移。我们提出的混音阶段模型极大地胜过了先前的手势生成方法，并为在多个扬声器上执行手势转移提供了途径。链接到代码，数据和视频：http：//chahuja.com/mix-stage

How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: http://chahuja.com/mix-stage

下载PDF全文

下载文献需遵守相关版权规定

论文标题