Putting People in their Place: Monocular Regression of 3D People in Depth

by   Yu Sun, et al.

Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset will be released for research purposes.


page 1

page 4

page 5

page 8


Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

We address the problem of multi-person 3D body pose and shape estimation...

What Face and Body Shapes Can Tell About Height

Recovering a person's height from a single image is important for virtua...

Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People

We present a novel method to improve the accuracy of the 3D reconstructi...

Coherent Reconstruction of Multiple Humans from a Single Image

In this work, we address the problem of multi-person 3D pose estimation ...

A Data-Driven Approach to Positioning Grab Bars in the Sagittal Plane for Elderly Persons

The placement of grab bars for elderly users is based largely on ADA bui...

Multi-Garment Net: Learning to Dress 3D People from Images

We present Multi-Garment Network (MGN), a method to predict body shape a...

TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

Although the estimation of 3D human pose and shape (HPS) is rapidly prog...

Please sign up or login with your details

Forgot password? Click here to reset