A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation

by   Naiyu Fang, et al.

As bird's-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via the vision Transformer (ViT). However, the quadratic complexity of ViT confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. And their plain fusion method of multi-view features does not conform to the information absorption intention in representing BEV features. To tackle these issues, we propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring. Specifically, we devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, we exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. We propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.


page 1

page 4

page 6

page 9

page 11


BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs

Semantic segmentation in bird's eye view (BEV) is an important task for ...

SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection

Recently, the pure camera-based Bird's-Eye-View (BEV) perception provide...

CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Bird's eye view (BEV) semantic segmentation plays a crucial role in spat...

LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation

Recent works in autonomous driving have widely adopted the bird's-eye-vi...

Transformer Scale Gate for Semantic Segmentation

Effectively encoding multi-scale contextual information is crucial for a...

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Recent vision-only perception models for autonomous driving achieved pro...

Image segmentation of cross-country scenes captured in IR spectrum

Computer vision has become a major source of information for autonomous ...

Please sign up or login with your details

Forgot password? Click here to reset