Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

by   Fengda Zhu, et al.

Vision-Language Navigation (VLN) is a task where agents learn to navigate following natural language instructions. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches exploit the vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have neglected the rich semantic information contained in the environment (such as implicit navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, estimating the navigation progress, predicting the next orientation, and evaluating the trajectory consistency. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activity and build a thorough perception of the environment. Our experiments indicate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. Empirically, we demonstrate that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.


page 1

page 8


Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Vision-language navigation (VLN) is the task of navigating an embodied a...

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

In the Vision-and-Language Navigation task, the embodied agent follows l...

Auxiliary Tasks Speed Up Learning PointGoal Navigation

PointGoal Navigation is an embodied task that requires agents to navigat...

A Self-Supervised Auxiliary Loss for Deep RL in Partially Observable Settings

In this work we explore an auxiliary loss useful for reinforcement learn...

A General Purpose Supervisory Signal for Embodied Agents

Training effective embodied AI agents often involves manual reward engin...

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

In a busy city street, a pedestrian surrounded by distractions can pick ...

Visual Probing: Cognitive Framework for Explaining Self-Supervised Image Representations

Recently introduced self-supervised methods for image representation lea...

Please sign up or login with your details

Forgot password? Click here to reset