Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

09/30/2019
by   Alexandros Stergiou, et al.
0

Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence of the merits of the decoupled spatio-temporal convolutions.

READ FULL TEXT

page 1

page 2

page 4

research
03/18/2020

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

Effective and Efficient spatio-temporal modeling is essential for action...
research
11/08/2020

Right on Time: Multi-Temporal Convolutions for Human Action Recognition in Videos

The variations in the temporal performance of human actions observed in ...
research
10/12/2021

TAda! Temporally-Adaptive Convolutions for Video Understanding

Spatial convolutions are widely used in numerous deep video models. It f...
research
07/06/2016

VideoLSTM Convolves, Attends and Flows for Action Recognition

We present a new architecture for end-to-end sequence learning of action...
research
09/18/2019

Class Feature Pyramids for Video Explanation

Deep convolutional networks are widely used in video action recognition....
research
09/07/2019

Exploring Temporal Differences in 3D Convolutional Neural Networks

Traditional 3D convolutions are computationally expensive, memory intens...
research
10/20/2021

GTM: Gray Temporal Model for Video Recognition

Data input modality plays an important role in video action recognition....

Please sign up or login with your details

Forgot password? Click here to reset