学习层次跨模式协会，用于共同语音的手势生成

论文标题

学习层次跨模式协会，用于共同语音的手势生成

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

论文作者

Liu, Xian, Wu, Qianyi, Zhou, Hang, Xu, Yinghao, Qian, Rui, Lin, Xinyi, Zhou, Xiaowei, Wu, Wayne, Dai, Bo, Zhou, Bolei

论文摘要

在虚拟化身创作中，产生言语一致的身体和手势运动是一个长期存在的问题。先前的研究经常以整体方式综合姿势运动，在同时产生所有关节的姿势。这样的直接管道无法产生细粒的共同语音手势。一个观察结果是，人类手势的语音和分层结构的层次语义可以自然地描述为多种粒度，并共同相关。为了充分利用语音音频与人类手势之间的丰富联系，我们提出了一个新颖的框架，称为层次的音频到手机（HA2G），以产生共同语音的手势。在HA2G中，分层音频学习者在语义粒度上提取音频表示。分层姿势推断者随后以分层的方式逐渐使整个人的姿势逐渐呈现。为了提高合成手势的质量，我们基于音频对齐方式制定了对比度学习策略，以提供更好的音频表示。广泛的实验和人类评估表明，所提出的方法使现实的共言手势和表现明确的方法优于先前的方法。项目页面：https：//alvinliu0.github.io/projects/ha2g

Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G

下载PDF全文

下载文献需遵守相关版权规定

论文标题