Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

03/28/2021
by   Mengmeng Xu, et al.
7

Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin.

READ FULL TEXT
research
11/21/2020

Boundary-sensitive Pre-training for Temporal Localization in Videos

Many video analysis tasks require temporal localization thus detection o...
research
11/11/2022

Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks

Temporal Action Localization (TAL) methods typically operate on top of f...
research
11/25/2022

Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predi...
research
04/26/2022

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are abl...
research
05/14/2022

ETAD: A Unified Framework for Efficient Temporal Action Detection

Untrimmed video understanding such as temporal action detection (TAD) of...
research
10/10/2022

Turbo Training with Token Dropout

The objective of this paper is an efficient training method for video ta...
research
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...

Please sign up or login with your details

Forgot password? Click here to reset