VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

by   Zhihao Hu, et al.

Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.


page 1

page 3

page 4

page 6

page 7

page 8

page 10

page 11


VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

In this paper, we present VideoGen, a text-to-video generation approach,...

Dual-Stream Diffusion Net for Text-to-Video Generation

With the emerging diffusion models, recently, text-to-video generation h...

ControlVideo: Training-free Controllable Text-to-Video Generation

Text-driven diffusion models have unlocked unprecedented abilities in im...

Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction

Video-to-video translation aims to generate video frames of a target dom...

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Diffusion models have exhibited promising progress in video generation. ...

Towards Smooth Video Composition

Video generation requires synthesizing consistent and persistent frames ...

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

This study introduces an efficient and effective method, MeDM, that util...

Please sign up or login with your details

Forgot password? Click here to reset