Centroid-centered Modeling for Efficient Vision Transformer Pre-training

03/08/2023
by   Xin Yan, et al.
0

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of tokenizer model. The centroids represent patch pixels and index tokens and have the property of local invariance. Non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. Specifically, we adopt patch masking and centroid replacement strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that the ViT-B model with only 300 epochs achieves 84.3% top-1 accuracy on ImageNet-1K classification and 51.6% on ADE20K semantic segmentation. Our approach achieves competitive results with BEiTv2 without distillation training from other models and outperforms other methods such as MAE.

READ FULL TEXT

page 3

page 4

research
06/15/2021

BEiT: BERT Pre-Training of Image Transformers

We introduce a self-supervised vision representation model BEiT, which s...
research
05/27/2022

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a si...
research
03/29/2022

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Image BERT pre-training with masked image modeling (MIM) becomes a popul...
research
03/20/2023

SeiT: Storage-Efficient Vision Training with Tokens Using 1 Storage

We need billion-scale images to achieve more generalizable and ground-br...
research
12/16/2022

Attentive Mask CLIP

Image token removal is an efficient augmentation strategy for reducing t...
research
08/12/2022

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Masked image modeling (MIM) has demonstrated impressive results in self-...
research
07/17/2023

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models...

Please sign up or login with your details

Forgot password? Click here to reset