Auto-scaling Vision Transformers without Training

by   Wuyang Chen, et al.

This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5 mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at


page 1

page 2

page 3

page 4


Understanding and Accelerating Neural Architecture Search with Training-Free and Theory-Grounded Metrics

This work targets designing a principled and unified training-free frame...

Automated Progressive Learning for Efficient Training of Vision Transformers

Recent advances in vision Transformers (ViTs) have come with a voracious...

ScaleNet: Searching for the Model to Scale

Recently, community has paid increasing attention on model scaling and c...

Sliced Recursive Transformer

We present a neat yet effective recursive operation on vision transforme...

Mesa: A Memory-saving Training Framework for Transformers

There has been an explosion of interest in designing high-performance Tr...

An Inverse Scaling Law for CLIP Training

CLIP, the first foundation model that connects images and text, has enab...

BranchNorm: Robustly Scaling Extremely Deep Transformers

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 l...

Please sign up or login with your details

Forgot password? Click here to reset