Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

by   Xiaolong Shen, et al.

Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at


page 1

page 3

page 7

page 8

page 11

page 12

page 13


TAPE: Temporal Attention-based Probabilistic human pose and shape Estimation

Reconstructing 3D human pose and shape from monocular videos is a well-s...

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Estimating 3D human poses only from a 2D human pose sequence is thorough...

Action-Agnostic Human Pose Forecasting

Predicting and forecasting human dynamics is a very interesting but chal...

Exemplar-based Video Colorization with Long-term Spatiotemporal Dependency

Exemplar-based video colorization is an essential technique for applicat...

EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

Transformer is popular in recent 3D human pose estimation, which utilize...

TempFuser: Learning Tactical and Agile Flight Maneuvers in Aerial Dogfights using a Long Short-Term Temporal Fusion Transformer

Aerial dogfights necessitate understanding the tactically changing maneu...

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

With the emergence of social media, voluminous video clips are uploaded ...

Please sign up or login with your details

Forgot password? Click here to reset