ActionVLAD: Learning spatio-temporal aggregation for action classification

04/10/2017
by   Rohit Girdhar, et al.
0

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13 base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

READ FULL TEXT

page 4

page 8

page 12

page 13

page 14

research
02/14/2021

Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Spatio-temporal convolution often fails to learn motion dynamics in vide...
research
09/29/2022

4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation

In this work, we present a new paradigm, called 4D-StOP, to tackle the t...
research
02/11/2020

Learning spatio-temporal representations with temporal squeeze pooling

In this paper, we propose a new video representation learning method, na...
research
08/24/2017

Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

Frame-level visual features are generally aggregated in time with the te...
research
04/26/2022

Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams

Devising intelligent agents able to live in an environment and learn by ...
research
05/30/2019

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Learning to represent videos is a very challenging task both algorithmic...
research
01/18/2021

Non-parametric Memory for Spatio-Temporal Segmentation of Construction Zones for Self-Driving

In this paper, we introduce a non-parametric memory representation for s...

Please sign up or login with your details

Forgot password? Click here to reset