Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

12/11/2021
by   Hanwen Liang, et al.
0

Recent self-supervised video representation learning methods have found significant success by exploring essential properties of videos, e.g. speed, temporal order, etc. This work exploits an essential yet under-explored property of videos, the video continuity, to obtain supervision signals for self-supervised representation learning. Specifically, we formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning. This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations. It outperforms prior arts on multiple downstream tasks, such as action recognition, video retrieval, and action localization. Additionally, the video continuity can be complementary to other coarse-grained video properties for representation learning, and integrating the proposed pretext task to prior arts can yield much performance gains.

READ FULL TEXT

page 1

page 3

page 7

research
07/08/2021

Video 3D Sampling for Self-supervised Representation Learning

Most of the existing video self-supervised methods mainly leverage tempo...
research
11/16/2022

Boosting Object Representation Learning via Motion and Object Continuity

Recent unsupervised multi-object detection models have shown impressive ...
research
10/28/2019

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

Deep neural networks require collecting and annotating large amounts of ...
research
09/26/2021

Self-Supervised Video Representation Learning by Video Incoherence Detection

This paper introduces a novel self-supervised method that leverages inco...
research
09/24/2022

Self-supervised Learning for Unintentional Action Prediction

Distinguishing if an action is performed as intended or if an intended a...
research
07/29/2020

Learning Video Representations from Textual Web Supervision

Videos found on the Internet are paired with pieces of text, such as tit...
research
01/02/2023

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

We address the problem of extracting key steps from unlabeled procedural...

Please sign up or login with your details

Forgot password? Click here to reset