DisCo: Disentangled Control for Referring Human Dance Generation in Real World

by   Tan Wang, et al.
Nanyang Technological University

Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code, demo, video and visualization are available at: https://disco-dance.github.io/.


page 2

page 7

page 8

page 11

page 13

page 14

page 15

page 16


Synthesizing Images of Humans in Unseen Poses

We address the computational problem of novel human pose synthesis. Give...

Text2Performer: Text-Driven Human Video Generation

Text-driven content creation has evolved to be a transformative techniqu...

Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis

Transferring human motion from a source to a target person poses great p...

TIPS: Text-Induced Pose Synthesis

In computer vision, human pose synthesis and transfer deal with probabil...

Few-shot Video-to-Video Synthesis

Video-to-video synthesis (vid2vid) aims at converting an input semantic ...

Human Motion Transfer from Poses in the Wild

In this paper, we tackle the problem of human motion transfer, where we ...

Behavior-Driven Synthesis of Human Dynamics

Generating and representing human behavior are of major importance for v...

Please sign up or login with your details

Forgot password? Click here to reset