Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

by   Li Yuan, et al.

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformers (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (e.g., ImageNet). We find it is because: 1) the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets when directly training on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet. (Code:


DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final predict...

CageViT: Convolutional Activation Guided Efficient Vision Transformer

Recently, Transformers have emerged as the go-to architecture for both v...

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the...

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been ...

Scalable Visual Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attentio...

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...

SeiT: Storage-Efficient Vision Training with Tokens Using 1 Storage

We need billion-scale images to achieve more generalizable and ground-br...

Please sign up or login with your details

Forgot password? Click here to reset