Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention

03/14/2022
by   Hou Pong (Ken) Chan, et al.
0

Grounding a command to the visual environment is an essential ingredient for interactions between autonomous vehicles and humans. In this work, we study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger. Prior work only employs the top layer representations of a vision-and-language pre-trained model to predict the region referred to by the command. However, such a method omits the useful features encoded in other layers, and thus results in inadequate understanding of the input scene and command. To tackle this limitation, we present the first layer fusion approach for this task. Since different visual regions may require distinct types of features to disambiguate them from each other, we further propose the region-specific dynamic (RSD) layer attention to adaptively fuse the multimodal information across layers for each region. Extensive experiments on the Talk2Car benchmark demonstrate that our approach helps predict more accurate regions and outperforms state-of-the-art methods.

READ FULL TEXT
research
09/18/2020

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

The task of visual grounding requires locating the most relevant region ...
research
12/24/2021

Grounding Linguistic Commands to Navigable Regions

Humans have a natural ability to effortlessly comprehend linguistic comm...
research
03/29/2022

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Visual grounding focuses on establishing fine-grained alignment between ...
research
09/11/2020

AttnGrounder: Talking to Cars with Attention

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end ...
research
05/25/2023

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

This paper addresses the problem of 3D referring expression comprehensio...
research
09/24/2022

Ground then Navigate: Language-guided Navigation in Dynamic Scenes

We investigate the Vision-and-Language Navigation (VLN) problem in the c...
research
12/13/2019

Grounding-Tracking-Integration

In this paper, we study tracking by language that localizes the target b...

Please sign up or login with your details

Forgot password? Click here to reset