Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

by   Yujun Ma, et al.

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi- Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.


page 1

page 8


Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

The task of skeleton-based action recognition remains a core challenge i...

Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition

The goal of fine-grained action recognition is to successfully discrimin...

VPN: Learning Video-Pose Embedding for Activities of Daily Living

In this paper, we focus on the spatio-temporal aspect of recognizing Act...

A Spatio-Temporal Multilayer Perceptron for Gesture Recognition

Gesture recognition is essential for the interaction of autonomous vehic...

Spatio-Temporal Covariance Descriptors for Action and Gesture Recognition

We propose a new action and gesture recognition method based on spatio-t...

Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition

Human gesture recognition has drawn much attention in the area of comput...

Multi-Grained Spatio-temporal Modeling for Lip-reading

Lip-reading aims to recognize speech content from videos via visual anal...

Please sign up or login with your details

Forgot password? Click here to reset