Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

by   Jiangliu Wang, et al.

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream video analytic tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at:


page 1

page 4

page 6

page 9

page 10

page 12


Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-an...

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation...

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Temporal cues in videos provide important information for recognizing ac...

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle...

Self-Supervised Video Representation Learning by Video Incoherence Detection

This paper introduces a novel self-supervised method that leverages inco...

Contextual Explainable Video Representation: Human Perception-based Understanding

Video understanding is a growing field and a subject of intense research...

FactorMatte: Redefining Video Matting for Re-Composition Tasks

We propose "factor matting", an alternative formulation of the video mat...

Please sign up or login with your details

Forgot password? Click here to reset