论文标题
avatarclip:零击文本驱动的生成和动画3D化身
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
论文作者
论文摘要
3D化身创作在数字时代起着至关重要的作用。但是,整个生产过程非常耗时和劳动力密集。为了将这项技术民主化向更大的受众群体,我们提出了Avatarclip,这是一个零击的文本驱动框架,用于3D Avatar生成和动画。与需要专家知识的专业软件不同,AvatarClip赋予了外行用户的使用权,可以自定义具有所需形状和纹理的3D化身,并使用仅使用自然语言的上述动作来驱动头像。我们的关键见解是利用强大的视觉语言模型剪辑来监督神经人类发电,从3D几何,纹理和动画方面。具体而言,在自然语言描述的驱动下,我们以形状VAE网络初始化了3D人类几何产生。基于生成的3D人形形状,使用体积渲染模型来进一步促进几何形状雕刻和纹理产生。此外,通过利用在运动VAE中学到的先验,提出了一种基于夹子的参考运动合成方法,用于生成的3D化身的动画。广泛的定性和定量实验验证了AvatarClip对各种化身的有效性和概括性。值得注意的是,Avatarclip可以通过新颖的动画产生看不见的3D化身,从而实现了出色的零击功能。
3D avatar creation plays a crucial role in the digital age. However, the whole production process is prohibitively time-consuming and labor-intensive. To democratize this technology to a larger audience, we propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages. Our key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation. Specifically, driven by natural language descriptions, we initialize 3D human geometry generation with a shape VAE network. Based on the generated 3D human shapes, a volume rendering model is utilized to further facilitate geometry sculpting and texture generation. Moreover, by leveraging the priors learned in the motion VAE, a CLIP-guided reference-based motion synthesis method is proposed for the animation of the generated 3D avatar. Extensive qualitative and quantitative experiments validate the effectiveness and generalizability of AvatarCLIP on a wide range of avatars. Remarkably, AvatarCLIP can generate unseen 3D avatars with novel animations, achieving superior zero-shot capability.