How Do Vision Transformers Work?

02/14/2022
by   Namuk Park, et al.
0

The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.

READ FULL TEXT
research
10/22/2022

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Vision Transformers has demonstrated competitive performance on computer...
research
05/31/2022

Surface Analysis with Vision Transformers

The extension of convolutional neural networks (CNNs) to non-Euclidean g...
research
06/23/2021

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

In this paper, we present Vision Permutator, a conceptually simple and d...
research
09/09/2023

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Transformers have been extensively studied in medical image segmentation...
research
05/26/2021

Blurs Make Results Clearer: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness

Bayesian neural networks (BNNs) have shown success in the areas of uncer...
research
04/22/2021

Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image rec...
research
02/14/2017

Exploring loss function topology with cyclical learning rates

We present observations and discussion of previously unreported phenomen...

Please sign up or login with your details

Forgot password? Click here to reset