Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

03/18/2021
by   Mandela Patrick, et al.
7

The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naive average pooling, the use of transformer-based attention improves performance significantly, and is well suited for processing feature crops. Combining both of our discoveries into a new method, Space-time Crop Attend (STiCA) we achieve state-of-the-art performance across multiple video-representation learning benchmarks. In particular, we achieve new state-of-the-art accuracies of 67.0 93.1

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/24/2018

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle...
research
11/10/2021

Space-Time Memory Network for Sounding Object Localization in Videos

Leveraging temporal synchronization and association within sight and sou...
research
12/07/2021

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...
research
11/26/2018

Evolving Space-Time Neural Architectures for Videos

In this paper, we present a new method for evolving video CNN models to ...
research
08/18/2023

Audio-Visual Glance Network for Efficient Video Recognition

Deep learning has made significant strides in video understanding tasks,...
research
01/23/2021

BSUV-Net 2.0: Spatio-Temporal Data Augmentations for Video-AgnosticSupervised Background Subtraction

Background subtraction (BGS) is a fundamental video processing task whic...

Please sign up or login with your details

Forgot password? Click here to reset