Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application

11/13/2022
by   Leijie Wu, et al.
0

Self-attention mechanisms, especially multi-head self-attention (MSA), have achieved great success in many fields such as computer vision and natural language processing. However, many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks, while ignoring the fundamental difference between “how MSA works in image and language settings”. Language naturally contains highly semantic structures that are directly interpretable by humans. Its basic unit (word) is discrete without redundant information, which readily supports interpretable studies on MSA mechanisms of language transformer. In contrast, visual data exhibits a fundamentally different structure: Its basic unit (pixel) is a natural low-level representation with significant redundancies in the neighbourhood, which poses obvious challenges to the interpretability of MSA mechanism in ViT. In this paper, we introduce a typical image processing technique, i.e., scale-invariant feature transforms (SIFTs), which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information. Next, we construct a weighted patch interrelation analysis based on SIFT keypoints to capture the attention patterns hidden in patches with different semantic concentrations Interestingly, we find this quantitative analysis is not only an effective complement to the interpretability of MSA mechanisms in ViT, but can also be applied to 1) spurious correlation discovery and “prompting” during model inference, 2) and guided model pre-training acceleration. Experimental results on both applications show significant advantages over baselines, demonstrating the efficacy of our method.

READ FULL TEXT

page 3

page 4

page 6

page 7

research
12/23/2020

A Survey on Visual Transformer

Transformer is a type of deep neural network mainly based on self-attent...
research
03/30/2023

Masked Autoencoders as Image Processors

Transformers have shown significant effectiveness for various vision tas...
research
05/24/2019

SCRAM: Spatially Coherent Randomized Attention Maps

Attention mechanisms and non-local mean operations in general are key in...
research
06/19/2022

Learning Multiscale Transformer Models for Sequence Generation

Multiscale feature hierarchies have been witnessed the success in the co...
research
06/23/2021

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...
research
03/27/2023

Evaluating self-attention interpretability through human-grounded experimental protocol

Attention mechanisms have played a crucial role in the development of co...
research
05/30/2021

Human Interpretable AI: Enhancing Tsetlin Machine Stochasticity with Drop Clause

In this article, we introduce a novel variant of the Tsetlin machine (TM...

Please sign up or login with your details

Forgot password? Click here to reset