VPN: Learning Video-Pose Embedding for Activities of Daily Living

by   Srijan Das, et al.

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.


page 2

page 13

page 24

page 25


VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living

Many attempts have been made towards combining RGB and 3D poses for the ...

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

We address human action recognition from multi-modal video data involvin...

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

There is significant progress in recognizing traditional human activitie...

A Nonparametric Model for Multimodal Collaborative Activities Summarization

Ego-centric data streams provide a unique opportunity to reason about jo...

JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

The availability of large-scale video action understanding datasets has ...

Proactive Robot Assistance via Spatio-Temporal Object Modeling

Proactive robot assistance enables a robot to anticipate and provide for...

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

RGB-D action and gesture recognition remain an interesting topic in huma...

Please sign up or login with your details

Forgot password? Click here to reset