Long-horizon video prediction using a dynamic latent hierarchy

by   Alexey Zakharov, et al.

The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) – a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.


page 6

page 7

page 9

page 16

page 18


Variational Predictive Routing with Nested Subjective Timescales

Discovery and learning of an underlying spatiotemporal hierarchy in sequ...

Temporally Consistent Video Transformer for Long-Term Video Prediction

Generating long, temporally consistent video remains an open challenge i...

A Neurally-Inspired Hierarchical Prediction Network for Spatiotemporal Sequence Learning and Prediction

In this paper we developed a hierarchical network model, called Hierarch...

Learning Representations for Control with Hierarchical Forward Models

Learning control from pixels is difficult for reinforcement learning (RL...

Multi-axis Attentive Prediction for Sparse EventData: An Application to Crime Prediction

Spatiotemporal prediction of event data is a challenging task with a lon...

Clockwork Variational Autoencoders

Deep learning has enabled algorithms to generate realistic images. Howev...

Physics-informed Tensor-train ConvLSTM for Volumetric Velocity Forecasting

According to the National Academies, a weekly forecast of velocity, vert...

Please sign up or login with your details

Forgot password? Click here to reset