SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

08/31/2022
by   Hongbo Sun, et al.
0

Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism research for discriminative regions mining while neglecting their interdependencies and composed holistic object structure, which are essential for model's discriminative information localization and understanding ability. To address the above limitations, we propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning to contain both the appearance information and structure information. Specifically, we encode the image into a sequence of patch tokens and build a strong vision transformer framework with two well-designed modules: (i) the structure information learning (SIL) module is proposed to mine the spatial context relation of significant patches within the object extent with the help of the transformer's self-attention weights, which is further injected into the model for importing structure information; (ii) the multi-level feature boosting (MFB) module is introduced to exploit the complementary of multi-level features and contrastive learning among classes to enhance feature robustness for accurate recognition. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily, which only depends on the attention weights that come with the vision transformer itself. Extensive experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks. The code is available at https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.

READ FULL TEXT

page 3

page 6

page 7

research
07/06/2021

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

The core for tackling the fine-grained visual categorization (FGVC) is t...
research
05/04/2022

Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification

Recently, self-attention mechanisms have shown impressive performance in...
research
03/31/2020

Look-into-Object: Self-supervised Structure Modeling for Object Recognition

Most object recognition approaches predominantly focus on learning discr...
research
08/05/2021

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Recently proposed fine-grained 3D visual grounding is an essential and c...
research
01/06/2022

TransVPR: Transformer-based place recognition with multi-level attention aggregation

Visual place recognition is a challenging task for applications such as ...
research
08/01/2022

Improving Fine-Grained Visual Recognition in Low Data Regimes via Self-Boosting Attention Mechanism

The challenge of fine-grained visual recognition often lies in discoveri...
research
09/25/2019

Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization

Fine-grained visual categorization (FGVC) is an important but challengin...

Please sign up or login with your details

Forgot password? Click here to reset