Temporal Transformer Networks with Self-Supervision for Action Recognition

12/14/2021
by   Yongkang Zhang, et al.
2

In recent years, 2D Convolutional Networks-based video action recognition has encouragingly gained wide popularity; However, constrained by the lack of long-range non-linear temporal relation modeling and reverse motion information modeling, the performance of existing models is, therefore, undercut seriously. To address this urgent problem, we introduce a startling Temporal Transformer Network with Self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision module. Concisely speaking, we utilize the efficient temporal transformer module to model the non-linear temporal dependencies among non-local frames, which significantly enhances complex motion feature representations. The temporal sequence self-supervision module we employ unprecedentedly adopts the streamlined strategy of "random batch random channel" to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used datasets (HMDB51, UCF101, and Something-something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.

READ FULL TEXT

page 1

page 3

page 9

page 11

research
11/23/2022

SVFormer: Semi-supervised Video Transformer for Action Recognition

Semi-supervised action recognition is a challenging but critical task du...
research
03/19/2022

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

Human action recognition has recently become one of the popular research...
research
06/07/2023

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

In this paper, we address the challenges posed by the substantial traini...
research
01/08/2022

Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition

Capturing the dependencies between joints is critical in skeleton-based ...
research
04/02/2020

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

Attentive video modeling is essential for action recognition in unconstr...
research
04/17/2021

Higher Order Recurrent Space-Time Transformer

Endowing visual agents with predictive capability is a key step towards ...
research
05/15/2023

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Dynamic scene graphs generated from video clips could help enhance the s...

Please sign up or login with your details

Forgot password? Click here to reset