SSAN: Separable Self-Attention Network for Video Representation Learning

05/27/2021
by   Xudong Guo, et al.
0

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets. Our models often outperform counterparts with shallower network and fewer modalities. We further verify the semantic learning ability of our method in visual-language task of video retrieval, which showcases the homogeneity of video representations and text embeddings. On MSR-VTT and Youcook2 datasets, video representations learnt by SSA significantly improve the state-of-the-art performance.

READ FULL TEXT

page 1

page 3

page 8

research
12/15/2020

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions via dot products to model lo...
research
11/25/2022

Spatial-Temporal Attention Network for Open-Set Fine-Grained Image Recognition

Triggered by the success of transformers in various visual tasks, the sp...
research
11/23/2020

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

We present a novel way for self-supervised video representation learning...
research
03/21/2022

Fourier Disentangled Space-Time Attention for Aerial Video Recognition

We present an algorithm, Fourier Activity Recognition (FAR), for UAV vid...
research
05/04/2020

Learning Geo-Contextual Embeddings for Commuting Flow Prediction

Predicting commuting flows based on infrastructure and land-use informat...
research
05/06/2023

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Egocentric gaze anticipation serves as a key building block for the emer...
research
06/07/2021

Video Imprint

A new unified video analytics framework (ER3) is proposed for complex ev...

Please sign up or login with your details

Forgot password? Click here to reset