Locality Guidance for Improving Vision Transformers on Tiny Datasets

by   Kehan Li, et al.

While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07 for PVT), and enhance even stronger baseline PVTv2 by 1.86 the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny-transformers.


page 3

page 13


LocalViT: Bringing Locality to Vision Transformers

We study how to introduce locality mechanisms into vision transformers. ...

Lightweight Vision Transformer with Bidirectional Interaction

Recent advancements in vision backbones have significantly improved thei...

E(2)-Equivariant Vision Transformer

Vision Transformer (ViT) has achieved remarkable performance in computer...

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Transformers have become central to recent advances in computer vision. ...

CNN Attention Guidance for Improved Orthopedics Radiographic Fracture Classification

Convolutional neural networks (CNNs) have gained significant popularity ...

Vision Transformers for femur fracture classification

Objectives: In recent years, the scientific community has focused on the...

Preserving Locality in Vision Transformers for Class Incremental Learning

Learning new classes without forgetting is crucial for real-world applic...

Please sign up or login with your details

Forgot password? Click here to reset