Phenaki: Variable Length Video Generation From Open Domain Textual Description

by   Ruben Villegas, et al.
University of Michigan

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.


page 2

page 7

page 8

page 17


Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Most methods for conditional video synthesis use a single modality as th...

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...

How can objects help action recognition?

Current state-of-the-art video models process a video clip as a long seq...

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

We address the task of supervised action segmentation which aims to part...

LTC-GIF: Attracting More Clicks on Feature-length Sports Videos

This paper proposes a lightweight method to attract users and increase v...

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Videos are created to express emotion, exchange information, and share e...

Variable Length Embeddings

In this work, we introduce a novel deep learning architecture, Variable ...

Please sign up or login with your details

Forgot password? Click here to reset