Pretrained Language Models as Visual Planners for Human Assistance

by   Dhruvesh Patel, et al.

To make progress towards multi-modal AI assistants which can guide users to achieve complex multi-step goals, we propose the task of Visual Planning for Assistance (VPA). Given a goal briefly described in natural language, e.g., "make a shelf", and a video of the user's progress so far, the aim of VPA is to obtain a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf", etc., to achieve the goal. This requires assessing the user's progress from the untrimmed video, and relating it to the requirements of underlying goal, i.e., relevance of actions and ordering dependencies amongst them. Consequently, this requires handling long video history, and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. We formulate the forecasting step as a multi-modal sequence modeling problem and present Visual Language Model based Planner (VLaMP), which leverages pre-trained LMs as the sequence model. We demonstrate that VLaMP performs significantly better than baselines w.r.t all metrics that evaluate the generated plan. Moreover, through extensive ablations, we also isolate the value of language pre-training, visual observations, and goal information on the performance. We will release our data, model, and code to enable future research on visual planning for assistance.


page 6

page 9


AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable ad...

Scene Induced Multi-Modal Trajectory Forecasting via Planning

We address multi-modal trajectory forecasting of agents in unknown scene...

Learning the Effects of Physical Actions in a Multi-modal Environment

Large Language Models (LLMs) handle physical commonsense information ina...

Distilling Script Knowledge from Large Language Models for Constrained Language Planning

In everyday life, humans often plan their actions by following step-by-s...

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Cognitive planning is the structural decomposition of complex tasks into...

Visual-Thermal Camera Dataset Release and Multi-Modal Alignment without Calibration Information

This report accompanies a dataset release on visual and thermal camera d...

A Preliminary Case Study of Planning With Complex Transitions: Plotting

Plotting is a tile-matching puzzle video game published by Taito in 1989...

Please sign up or login with your details

Forgot password? Click here to reset