Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification

by   Bo Zhang, et al.

Few-shot fine-grained learning aims to classify a query image into one of a set of support categories with fine-grained differences. Although learning different objects' local differences via Deep Neural Networks has achieved success, how to exploit the query-support cross-image object semantic relations in Transformer-based architecture remains under-explored in the few-shot fine-grained scenario. In this work, we propose a Transformer-based double-helix model, namely HelixFormer, to achieve the cross-image object semantic relation mining in a bidirectional and symmetrical manner. The HelixFormer consists of two steps: 1) Relation Mining Process (RMP) across different branches, and 2) Representation Enhancement Process (REP) within each individual branch. By the designed RMP, each branch can extract fine-grained object-level Cross-image Semantic Relation Maps (CSRMs) using information from the other branch, ensuring better cross-image interaction in semantically related local object regions. Further, with the aid of CSRMs, the developed REP can strengthen the extracted features for those discovered semantically-related local regions in each branch, boosting the model's ability to distinguish subtle feature differences of fine-grained objects. Extensive experiments conducted on five public fine-grained benchmarks demonstrate that HelixFormer can effectively enhance the cross-image object semantic relation matching for recognizing fine-grained objects, achieving much better performance over most state-of-the-art methods under 1-shot and 5-shot scenarios. Our code is available at: https://github.com/JiakangYuan/HelixFormer


page 4

page 8

page 12

page 13


Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

The goal of few-shot fine-grained image classification is to recognize r...

Cross-X Learning for Fine-Grained Visual Categorization

Recognizing objects from subcategories with very subtle differences rema...

Learning Gabor Texture Features for Fine-Grained Recognition

Extracting and using class-discriminative features is critical for fine-...

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

The top-down and bottom-up methods are two mainstreams of referring segm...

Learning Semantically Enhanced Feature for Fine-Grained Image Classification

We target at providing a computational cheap yet effective approach for ...

Multi-View Active Fine-Grained Recognition

As fine-grained visual classification (FGVC) being developed for decades...

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

3D representation disentanglement aims to identify, decompose, and manip...

Please sign up or login with your details

Forgot password? Click here to reset