Improving Vision Transformers for Incremental Learning

12/12/2021
by   Pei Yu, et al.
0

This paper studies using Vision Transformers (ViT) in class incremental learning. Surprisingly, naive application of ViT to replace convolutional neural networks (CNNs) results in performance degradation. Our analysis reveals three issues of naively using ViT: (a) ViT has very slow convergence when class number is small, (b) more bias towards new classes is observed in ViT than CNN-based models, and (c) the proper learning rate of ViT is too low to learn a good classifier. Base on this analysis, we show these issues can be simply addressed by using existing techniques: using convolutional stem, balanced finetuning to correct bias, and higher learning rate for the classifier. Our simple solution, named ViTIL (ViT for Incremental Learning), achieves the new state-of-the-art for all three class incremental learning setups by a clear margin, providing a strong baseline for the research community. For instance, on ImageNet-1000, our ViTIL achieves 69.20 500 initial classes with 5 incremental steps (100 new classes for each), outperforming LUCIR+DDE by 1.69 incremental steps (100 new classes), our method outperforms PODNet by 7.27 (65.13

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset