Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

by   Andreas Blattmann, et al.

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page:


page 33

page 34

page 35

page 36

page 37

page 38

page 39

page 42


VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

In this paper, we present VideoGen, a text-to-video generation approach,...

ControlMat: A Controlled Generative Approach to Material Capture

Material reconstruction from a photograph is a key component of 3D conte...

SimDA: Simple Diffusion Adapter for Efficient Video Generation

The recent wave of AI-generated content has witnessed the great developm...

Video Probabilistic Diffusion Models in Projected Latent Space

Despite the remarkable progress in deep generative models, synthesizing ...

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Image and video synthesis are closely related areas aiming at generating...

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

With the advance of text-to-image models (e.g., Stable Diffusion) and co...

Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

Text-to-video (T2V) synthesis has gained increasing attention in the com...

Please sign up or login with your details

Forgot password? Click here to reset