DeepAI AI Chat
Log In Sign Up

Structured World Models from Human Videos

by   Russell Mendonca, et al.

We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction. Videos can be found at


page 1

page 2

page 3

page 4

page 5

page 7

page 8


VideoDex: Learning Dexterity from Internet Videos

To build general robotic agents that can operate in many environments, i...

Learning Predictive Models From Observation and Interaction

Learning predictive models from interaction with the world allows an age...

Affordances from Human Videos as a Versatile Representation for Robotics

Building a robot that can understand and learn to interact by watching h...

ALAN: Autonomously Exploring Robotic Agents in the Real World

Robotic agents that operate autonomously in the real world need to conti...

Toward Grounded Social Reasoning

Consider a robot tasked with tidying a desk with a meticulously construc...

Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on Youtube

We build a system that enables any human to control a robot hand and arm...