OhMG: Zero-shot Open-vocabulary Human Motion Generation

by   Junfan Lin, et al.

Generating motion in line with text has attracted increasing attention nowadays. However, open-vocabulary human motion generation still remains touchless and undergoes the lack of diverse labeled data. The good news is that, recent studies of large multi-model foundation models (e.g., CLIP) have demonstrated superior performance on few/zero-shot image-text alignment, largely reducing the need for manually labeled data. In this paper, we take advantage of CLIP for open-vocabulary 3D human motion generation in a zero-shot manner. Specifically, our model is composed of two stages, i.e., text2pose and pose2motion. For text2pose, to address the difficulty of optimization with direct supervision from CLIP, we propose to carve the versatile CLIP model into a slimmer but more specific model for aligning 3D poses and texts, via a novel pipeline distillation strategy. Optimizing with the distilled 3D pose-text model, we manage to concretize the text-pose knowledge of CLIP into a text2pose generator effectively and efficiently. As for pose2motion, drawing inspiration from the advanced language model, we pretrain a transformer-based motion model, which makes up for the lack of motion dynamics of CLIP. After that, by formulating the generated poses from the text2pose stage as prompts, the motion generator can generate motions referring to the poses in a controllable and flexible manner. Our method is validated against advanced baselines and obtains sharp improvements. The code will be released here.


page 5

page 17


Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Recent large-scale text-to-image generation models have made significant...

ZYN: Zero-Shot Reward Models with Yes-No Questions

In this work, we address the problem of directing the text generations o...

Vocabulary-informed Zero-shot and Open-set Learning

Despite significant progress in object categorization, in recent years, ...

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Text-guided human motion generation has drawn significant interest becau...

CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes

We propose CLIP-Actor, a text-driven motion recommendation and neural me...

Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization u...

Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark

Recent studies have demonstrated promising potential of ChatGPT for vari...

Please sign up or login with your details

Forgot password? Click here to reset