Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval

by   Zhiqiang Yuan, et al.

Remote sensing (RS) cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. However, traditional methods ignore the characteristics of multi-scale and redundant targets in RS image, leading to the degradation of retrieval accuracy. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN). Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features. AMFMN employs the multi-scale visual self-attention (MVSA) module to extract the salient features of RS image and utilizes visual features to guide the text representation. Furthermore, to alleviate the positive samples ambiguity caused by the strong intraclass similarity in RS image, we propose a triplet loss function with dynamic variable margin based on prior similarity of sample pairs. Finally, unlike the traditional RS image-text dataset with coarse text and higher intraclass similarity, we construct a fine-grained and more challenging Remote sensing Image-Text Match dataset (RSITMD), which supports RS image retrieval through keywords and sentence separately and jointly. Experiments on four RS text-image datasets demonstrate that the proposed model can achieve state-of-the-art performance in cross-modal RS text-image retrieval task.


page 2

page 4

page 9

page 14

page 15

page 16

page 17

page 19


Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Cross-modal remote sensing text-image retrieval (RSCTIR) has recently be...

MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification

To benefit the complementary information between heterogeneous data, we ...

Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation

Remote sensing (RS) image retrieval based on visual content is of great ...

Progressive Scale-aware Network for Remote sensing Image Change Captioning

Remote sensing (RS) images contain numerous objects of different scales,...

Learning to Evaluate Performance of Multi-modal Semantic Localization

Semantic localization (SeLo) refers to the task of obtaining the most re...

Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

Language-guided image retrieval enables users to search for images and i...

Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing

Image-text retrieval in remote sensing aims to provide flexible informat...

Please sign up or login with your details

Forgot password? Click here to reset