RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework

by   Yuelei Wang, et al.

In recent years, remote sensing (RS) vision foundation models such as RingMo have emerged and achieved excellent performance in various downstream tasks. However, the high demand for computing resources limits the application of these models on edge devices. It is necessary to design a more lightweight foundation model to support on-orbit RS image interpretation. Existing methods face challenges in achieving lightweight solutions while retaining generalization in RS image interpretation. This is due to the complex high and low-frequency spectral components in RS images, which make traditional single CNN or Vision Transformer methods unsuitable for the task. Therefore, this paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework, which effectively exploits the frequency-domain properties of RS to optimize the interpretation process. It is combined by the Transformer module as a low-pass filter to extract global features of RS images through a dual-branch structure, and the CNN module as a stacked high-pass filter to extract fine-grained details effectively. Furthermore, in the pretraining stage, the designed frequency-domain masked image modeling (FD-MIM) combines each image patch's high-frequency and low-frequency characteristics, effectively capturing the latent feature representation in RS data. As shown in Fig. 1, compared with RingMo, the proposed RingMo-lite reduces the parameters over 60 interpretation tasks, the average accuracy drops by less than 2 scenes and achieves SOTA performance compared to models of the similar size. In addition, our work will be integrated into the MindSpore computing platform in the near future.


page 1

page 2

page 6

page 10

page 15


Lightweight Structure-aware Transformer Network for VHR Remote Sensing Image Change Detection

Popular Transformer networks have been successfully applied to remote se...

LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing

Visual question answering (VQA) methods in remote sensing (RS) aim to an...

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM)...

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

To benefit the complementary information between heterogeneous data, we ...

Improving Vision Transformers by Revisiting High-frequency Components

The transformer models have shown promising effectiveness in dealing wit...

Contextual Learning in Fourier Complex Field for VHR Remote Sensing Images

Very high-resolution (VHR) remote sensing (RS) image classification is t...

APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP

In recent years, the success of large-scale vision-language models (VLMs...

Please sign up or login with your details

Forgot password? Click here to reset