Neighbor-view Enhanced Model for Vision and Language Navigation

07/15/2021
by   Dong An, et al.
0

Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., “turn left”) in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

READ FULL TEXT

page 2

page 4

page 8

research
07/29/2020

Object-and-Action Aware Model for Visual Language Navigation

Vision-and-Language Navigation (VLN) is unique in that it requires turni...
research
03/22/2022

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...
research
05/26/2023

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Most existing works solving Room-to-Room VLN problem only utilize RGB im...
research
03/28/2023

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Vision-and-language navigation (VLN) is the task to enable an embodied a...
research
04/11/2023

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-and-Language Navigation (VLN) is the task that requires an agent ...
research
05/05/2023

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a realistic but challenging task...
research
06/30/2022

Visual Pre-training for Navigation: What Can We Learn from Noise?

A powerful paradigm for sensorimotor control is to predict actions from ...

Please sign up or login with your details

Forgot password? Click here to reset