ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

by   Bingqian Lin, et al.

Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i.e., make instruction-asked actions sequentially in complex visual environments. Most existing VLN agents learn the instruction-path data directly and cannot sufficiently explore action-level alignment knowledge inside the multi-modal inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT), which provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment to pursue successful navigation. Specifically, an action prompt is defined as a modality-aligned pair of an image sub-prompt and a text sub-prompt, where the former is a single-view observation and the latter is a phrase like ”walk past the chair”. When starting navigation, the instruction-related action prompt set is retrieved from a pre-built action prompt base and passed through a prompt encoder to obtain the prompt feature. Then the prompt feature is concatenated with the original instruction feature and fed to a multi-layer transformer for action prediction. To collect high-quality action prompts into the prompt base, we use the Contrastive Language-Image Pretraining (CLIP) model which has powerful cross-modality alignment ability. A modality alignment loss and a sequential consistency loss are further introduced to enhance the alignment of the action prompt and enforce the agent to focus on the related prompt sequentially. Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.


page 3

page 4

page 8

page 13

page 14

page 15

page 16


Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Vision-Language Navigation (VLN) is a challenging task which requires an...

Accessible Instruction-Following Agent

Humans can collaborate and complete tasks based on visual signals and in...

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...

A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), ...

Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

We introduce a novel speaker model Kefa for navigation instruction gener...

Transferable Representation Learning in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) re...

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Following a navigation instruction such as 'Walk down the stairs and sto...

Please sign up or login with your details

Forgot password? Click here to reset