1 Explainability Through Attention Maps
Generalizing beyond single-task solutions using large-scale transformer-based language models has gained increasing attention from the community. In particular, the switch to open-vocabulary predictions promises AI systems capable of adapting beyond before-seen training objectives. Arguably, transformers are the state-of-the-art method in Natural Language Processing (NLP) and Computer Vision. Most recently, they demonstrated remarkable performance on multi-modal modes, e.g., bridging Computer Vision (CV) capabilities with text understanding to solve Visual Question Answering (VQA) scenarios
[9, 17, 29, 28]. The increasing adoption of transformers, however, also raises the necessity to better understand the reasons behind their otherwise black-box predictions. Unfortunately, the “scale is all you need” assumption of transformers results in severely large and complex architectures, making their training, inference deployment, and understanding a resource-intensive task that requires multiple enterprise-grade GPUs or even entire computing nodes, along with prolonged runtimes.
Most, if not all, explainable AI (XAI) methods—making the decision-making processes and internal workings of AI models transparent and understandable to humans—for transformers work by propagating (some form of) gradients back through the model. This backpropagation allows for the accumulation of information about how each input feature contributes to output tokens [6, 1], utilizing stored activations during the forward pass. Unfortunately, this leads to a significant memory consumption overhead, which renders their productive deployment to be uneconomical, if not impossible. Often half of the available memory of the GPU has to remain empty on inference, or it requires an entirely separate deployment of the XAI pipeline.
Fortunately, another popular XAI idea, namely perturbation [18, 23], is much more memory-efficient. Though it has not been proven beneficial for explaining the predictions of transformers, most likely because of the immense number of necessary forward trials accumulating unreasonable computation time.
To tackle these issues and, in turn, scale explanations with the size of transformers, we propose to bridge relevance propagation and perturbations. In contrast to existing perturbation methods, executing perturbations directly in the input space, we push them into the latent space, allowing, as we will show, state interpolation and token-based similarity measures. Specifically, inspired by
[10] and backpropagation approaches, we introduce attention manipulations throughout latent layers of the transformer during the forward pass as a method to steer model predictions. Our explanation methods, called AtMan, then leverages these predictions to compute relevance values for transformer networks. Our experimental evidence demonstrates that
AtMan significantly reduces the number of required perturbations, making them applicable at deployment time, and does not require additional memory compared to the forward passes. In short, AtMan can scale with transformers. Our exhaustive experiments on text and image-text benchmarks also demonstrate that AtMan outperforms current state-of-the-art based on gradients while being computationally efficient. Actually, for the first time, AtMan allows one to study generative model predictions as visualized in Fig. 1. During the sequence generation with large multi-modal models, AtMan is able to additionally highlight relevant features wrt. the input proving novel insights on the generation process.Contributions.
In summary, our contributions are: (i) An examination of the effects of token-based attention score manipulation on generative transformer models. (ii) The introduction of a novel and memory-efficient XAI perturbation method for large-scale transformer models, called AtMan, which reduces the number of required iterations to a computable amount by correlating tokens in the embedding space. (iii) Exhaustive multi-modal evaluations of XAI methods on several text and image-text benchmarks and autoregressive (AR) transformers. We release the source code of the proposed method and all evaluation scriptshttps://github.com/Mayukhdeb/atman-magma.
We proceed as follows. We start off by discussing related work. Then, we derive AtMan and explain its attention manipulation as a perturbation technique. Before concluding and discussing the benefits as well as limitations, we touch upon our experimental evaluation, showing that AtMan not only nullifies memory overhead but also outperforms competitors on several visual and textual reasoning benchmarks.
2 Related Work
Explainability in CV and NLP.
Explainability of AI systems is a still ambiguously defined term [7]. XAI methods are expected to show some level of relevance on the input with respect to the computed result of an algorithm. This task is usually tackled by constructing an input relevance map given the model’s prediction. The nature of relevance can be class-specific, e.g., depending on specific target instances of a task and showing a local solution [25, 26], or class-agnostic, i.e., depending on the global behavior of the model behavior only [1, 3]. The level of fine granularity of the achieved explanation depends, therefore, on the chosen method, the model, and the actual evaluation benchmark.
Explainability in CV is usually evaluated by mapping the relevance maps to a pixel level and regard the evaluation as a weak segmentation task [24, 19, 26]. On the other hand, NLP explanations are much more vaguely defined and usually mixed with more complex philosophical interpretations, such as labeling a given text to a certain sentiment category [7].
The majority of XAI methods can be divided into the classes of perturbation and gradient analysis. Perturbations treat the model as a black box and attempt to derive knowledge of the model’s behavior by studying changes in input-output pairs only. Gradient-based methods, on the other hand, execute a backpropagation step towards a target and aggregate the model’s parameter adoptions to derive insights.
Most of these XAI methods usually are not motivated by a specific discipline, e.g., neither by NLP nor CV. They are so generic that they can be applied to both disciplines, to some extent. However, architecture-specific XAI methods exist, such as GradCAM [24]
, leveraging convolutional neural networks’ spatial input aggregation in the deepest layers to increase efficiency.
Explainability in Transformers.
Through their increasing size, transformers are particularly challenging for explainability methods, especially for architecture-agnostic ones. Transformers’ core components, in particular, include an embedding layer followed by multiple layers of alternating attention and feed-forward blocks. The attention blocks map the input into separate “query”, “key”, and “value” matrices and are split into an array of “heads”. As with convolutions in CNN networks, separation heads are believed to relate to specific learned features or tasks [12]. Further, the attention matrix dimensions match that of the input sequence dimension, which makes the attention mechanism in particular suited for deriving input explanations.
Consequently, most explainability adoptions to transformers focus on the attention mechanism. rollout (rollout) assume that activations in attention layers are combined linearly and considered paths along the pairwise attention graph. However, while being efficient, it often emphasizes irrelevant tokens, in particular, due to its class-agnostic nature. Therefore, the authors also propose attention flow [1], which is unfeasible to use due to its high computational demands in constructing graphs. More recently, Chefer_2021_CVPR (Chefer_2021_CVPR) proposed to aggregate backward gradients and LRP [19] throughout all layers and heads of the attention modules in order to derive explanation relevancy. Their introduced method outperforms previous transformer-specific and unspecific XAI methods on several benchmarks and transformer models. This method is extended to multimodal transformers [5] by studying other variations of attention. However, the evaluated benchmarks only include classification tasks, despite transformers’ remarkable performance on open-vocabulary tasks, e.g., utilizing InstructGPT [20] or multimodal autoregressive transformers such as MAGMA [9], BLIP [17] and OFA [29].
Mutlimodal Transformers.
Contrarily to these explainability studies evaluating on models like DETR and ViT [4, 8]
, we study explainability on generated text tokens of a language model, and not specifically trained classifiers. Due to the multimodality, the XAI method should produce output relevancy either on the input text or the input image as depicted in Fig.
1.To this end, we study the explainability of multimodal transformer architectures such as MAGMA [9].111 An open-source version can be found at
3 AtMan: Attention Manipulation
We formulate finding the best explainability estimator of a model as solving the following question:
What is the most important part on the input, annotated by the explanator, to produce the model’s output? In the following, we derive our perturbation probe mathematically through studies of influence functions and embedding layer updates on autoregressive (AR) models [13, 2].Then we show how attention manipulation on single tokens can be used in NLP tasks to steer the prediction of a model in directions found within the prompt. Finally, we derive our multi-modal XAI method AtMan by extending this concept to the cosine neighborhood in the embedding space.
3.1 Influence Functions as Explainability Estimators
Transformer-based language models are probability distribution estimators. They map from some input space
(e.g. text or image embeddings) to an output space(e.g. language token probabilities). Let
be the space of all explanations (i.e. binary labels) over . An explanator function can then be defined asi.e. given a model, an input, and a target, derive a label on the input.

transformer blocks to produce output logits, e.g., taken for the next token prediction in the setting of a generative language model. The middle shows in detail a masked attention block, consisting of MatMul, Mask, and SoftMax steps. The right-hand side shows our proposed Attention Manipulation method. We multiply the modifier factors and the attention scores, before applying the diagonal causal attention mask. Red hollow boxes (


Given a sequence , an AR language model assigns a probability to that sequence by applying factorization . The loss optimization during training can then be formalized by solving:
(1) | ||||
(2) |
Here denotes the model, the learned embedding matrix, and the vocabulary index of the target of length . Eq. 1 is derived by integrating the cross-entropy loss, commonly used during language model training with . Finally,
denotes our loss function.
Perturbation methods study the influence of the model’s predictions by adding small noise to the input and measuring the prediction change. We follow the results of the studies [13, 2] to approximate the perturbation effect directly through the model’s parameters when executing Leaving-One-Out experiments on the input. The influence function estimating the perturbation of an input is then derived as:
(3) |
Here denotes the set of model parameters in which would not have been seen during training. In the following, we further show how to approximate .
3.2 Single Token Attention Manipulation
The core idea of AtMan is the shift of the perturbation space from the raw input space to the embedded token space. This allows us to reduce the dimensionality of possible perturbations down to a single scaling factor per token. Moreover, we do not manipulate the value matrix of attention blocks and therewith do not introduce the otherwise inherent input-distribution shift of obfuscation methods. By manipulating the attention entries at the positions of the corresponding input sequence tokens, we are able to interpolate the focus of the prediction distribution of the model—amplifying or suppressing concepts of the prompt. The following shows that this procedure indeed derives a well-performing XAI method.
Attention was introduced in [27] as: , where denotes matrix multiplication. The pre-softmax query-key attention scores are defined as:
In the case of autoregression, a lower left triangular unit mask M is applied to these scores as with the Hadamard product. The output of the self-attention module is , the query matrix is and the keys and values matrices. Finally . The number of heads is denoted as , and is the embedding dimension of the model. Finally, there are query and key tokens that coincide here with the dimension of input-sequence tokens.


The perturbation approximation required by Sec. 3.1 can now be approximated through attention score manipulation as follows: Let w be an input token sequence of length . Let be a token index within this sequence to be perturbated by a factor . For all layers and all heads we modify the pre-softmax query-key attention scores as:
(4) |
where denotes the matrix containing only ones and the suppression factor matrix for token . In this section we set , for and and elsewhere. As depicted in Fig. 2 we thus only amplify the column of the attention scores of by a factor . This, however for all heads equally.222We follow the common assumption that all relevant entropy of the input token is processed primarily at that position within the attention module due to the sequence-to-sequence nature of the transformer. A different variant of this approach is discussed in Appendix A.5. Let us denote this modification to the model by and assume a fixed factor .333We ran a parameter sweep once to fix this parameter.
We define for a class label target
the explanation as the vector of the influence functions to all positions:
(5) |
In words, we average the cross-entropy of the AR input sequence wrt. all target tokens and measure the change when suppressing token index to the unmodified one. The explanation becomes this difference vector of all possible sequence position perturbations and thus requires iterations.

Fig. 3 illustrates this algorithm. The original input prompt is the text “Ben likes to eat burgers and ” for which we want to extract the most valuable token for the completion and target token “fries”. Initially, the model predicts the target token with a cross-entropy score of . We now iterate through the input tokens, suppressing them one by one, and track the changes in the cross-entropy of the target token, as depicted in the right-most column. In this example, it can be observed that “burgers” was the most-influential input token to complete the sentence with “fries”, with the highest score of .
Next, we give a more descriptive intuition about the effects of such modifications on the model’s generative nature.
Token attention suppression steers the model’s prediction.
Intuitively, for factors , we call the modifications “suppression”, as we find the model’s output now relatively less influenced by the token at the position of the respective manipulated attention scores. Contrarily, “amplifies” the influence of the manipulated input token on the output.
An example of the varying continuations when a single token manipulation is applied can be seen in Fig. 4. We provide the model a prompt in which the focus of continuation largely depends on two tokens, namely “soccer” and “math”. We show how suppressing and amplifying them alters the prediction distributions away from or towards to those concepts. It is precisely this distribution shift we measure and visualize as our explainability.

3.3 Correlated Token Attention Manipulation
Suppressing single tokens works well when the entire entropy responsible to produce the target token occurs only once. However, for inputs with redundant information, this approach would often fail. This issue is, in particular, prominent in the field of CV, as information, e.g., about objects in an image, is often spread across several embeddings due to the split of image parts and the separate application of embedding function. It is a common finding that applied cosine similarity in the embedding space, e.g., right after the embedding layer, gives a good correlation distance estimator [16, 2]. We integrate this finding into AtMan in order to suppress all redundant information corresponding to a particular input token at once, which we refer to as correlated token suppression.
Fig. 5 summarizes the correlated token suppression visually. For input tokens and embedding dimension , the embedded tokens result in a matrix . The cosine similarity, in turn, is computed from the normalized embeddings , with , for , as . Note that the index denotes a column corresponding to the respective input token index. Intuitively, the vector then contains similarity scores to all (other) input tokens. Suppressing the correlated neighborhood to a specific token with the index , we, therefore, adjust the suppression factor matrix for Eq. 4 as
(6) |
As we only want to suppress tokens, we restrict the range of factor values to be greater than . The parameter is to ensure a lower bound, and in particular, prevents a sign flip. We empirically fixed through a parameter sweep (Appendix A.4).
With that, we arrived at our final version of AtMan. As a final remark note that this form of explanation is local, as target refers to our target-class. We can however straightforward derive a global explanation by setting , for y a model completion to input w of certain length. It could then be interpreted rather abstract as the model’s general focus [3].

4 Empirical Evaluation
We ran empirical evaluations on text and image corpora to address the following questions: (Q1) Does AtMan achieve competitive results compared to previous XAI for transformers in the language as well as vision domain? (Q2) Does AtMan scale efficiently and, therefore, can be applied to current large-scale AR models?
To answer these questions, we conducted empirical studies on textual and visual XAI benchmarks and compared AtMan to standard approaches such as IxG [25], IG [26], GradCAM [24] and the transformer-specific XAI method of [6] called Chefer in the following. Note that all these methods utilize gradients and, therefore, categorize as propagation methods leading to memory inefficiency. We also applied existing perturbation methods such as LIME [23] and SHAP [18]. However, they failed due to extremely large trials and, in turn, prohibitive computation time. We adopt common metrics, namely mean average precision (mAP) and recall (mAR), and state their interquartile statistics in all experiments. Whereas through its memory efficiency AtMan can be utilized on larger models, to provide a comparison between XAI methods, we ran the corresponding experiments on MAGMA-6B444Available at https://github.com/aleph-alpha/magma . if not stated otherwise.
4.1 AtMan can do Language reasoning
Protocol.
Since with AtMan we aim to study large-scale generative models, we formulate XAI on generative tasks as described in Sec. 3.3. To this end, we used the Stanford Question Answering (QA) Dataset (SQuAD) [22]. The QA dataset is structured as follows: Given a single paragraph of information, there are multiple questions, each with a corresponding answer referring to a position in the paragraph. A visualization of an instance of this dataset can be found in Fig. 6. In total, SQuAD contains 536 unique paragraphs and 107,785 question/explanation pairs. The average context sequence length is tokens, and the average label (explanation) length is .

IxG | IG | Chefer | AtMan | |
---|---|---|---|---|
mAP | 51.7 | 49.5 | 72.7 | 73.7 |
mAP | 61.4 | 49.5 | 77.5 | 81.8 |
mAR | 91.8 | 87.1 | 96.6 | 93.4 |
mAR | 100 | 98.6 | 100 | 100 |
The model was prompted with the template: “{Context} Q: {Question} A:”, and the explainability methods executed to derive scores for the tokens inside the given context, c.f. Fig. 6
. If there were multiple tokens in the target label, we computed the average of the scores for the target token. Similar to weak segmentation tasks in computer vision, we regarded the annotated explanations as binary labels and determined precision and recall over all these target tokens.
Results.
The results are shown in Tab. 1. It can be observed that the proposed AtMan method thoroughly outperforms all previous approaches by means of mean average precision. This statement holds as well for the mean average interquartile recall. However, on the mean average recall Chefer slightly outperforms AtMan. Furthermore, it is noteworthy that the small average explanation length (such as depicted in Fig. 6) results in high values for recall scores in all methods. Further details and some qualitative examples can be found in Appendix A.2.
Paragraph Chunking.
AtMan can naturally be lifted to the explanation of paragraphs. We ran experiments for AtMan splitting the input text into a few paragraphs by splitting by common delimiters and evaluating the resulting chunks simultaneously, despite token-wise evaluations. This significantly decreases the total number of required forward passes and, on top, produces “more human” text explanations of the otherwise still heterogeneously highlighted word parts. Results are shown in Appendix A.8.
4.2 AtMan can do Visual reasoning
Protocol.
Similar to language reasoning, we again perform XAI on generative models. We evaluated the OpenImages [15]
dataset as VQA task and generated open-vocabulary prediction with the autoregressive model. Specifically, the model is prompted with the template: “{Image} This is a picture of ”, and the explainability methods executed to derive scores for the pixels of the image with respect to the target label. If there were multiple tokens in the target label, we take the average of the generated scores for each target token. For evaluation, we considered the segmentation annotations of the dataset as ground truth explanations. The segmentation subset contains 2,7M annotated images for 350 different classes. In order to ensure a good performance of the large-scale model at hand and, in turn, adequate explanations of the XAI methods, we filtered the images for a minimum dimension of
pixels and a maximal proportional deviation between width and height of . Moreover, we randomly sample images per class to avoid overweighting classes. This filtering leads to a dataset of samples. The average context sequence length is tokens and the average label coverage is of the input image.Quantitative Results.
The results are shown in Tab. 2. It can be observed that AtMan thoroughly outperforms all other XAI approaches on the visual reasoning task for all metrics. Note how explicit transformer XAI methods (AtMan, Chefer) in particular outperform generic methods (GradCAM, IG, IxG) in recall. Moreover, while being memory-efficient (see next section), AtMan also generates more accurate explanations compared to Chefer. Through the memory efficiency of AtMan, we were able to evaluate an intermediate version of a 30B upscaling trial of MAGMA (c.f. Tab. 2). Interestingly, the general explanation performance slightly decreases compared to the 6B model variant. This could be attributed to the increased complexity of the model and, subsequently, the complexity of the explanation at hand. Hence, it is not expected that the “human” alignment with the model’s explanations scales with their size.
IxG | IG | GradCAM | Chefer | AtMan | AtMan | |
---|---|---|---|---|---|---|
mAP | 38.0 | 46.1 | 56.7 | 49.9 | 65.5 | 61.2 |
mAP | 34.1 | 45.2 | 60.4 | 50.2 | 70.2 | 65.1 |
mAR | 0.2 | 0.3 | 0.1 | 11.1 | 13.7 | 12.2 |
mAR | 0.1 | 0.1 | 0.1 | 10.1 | 19.7 | 14.5 |
Qualitative Illustration.
Fig. 21 shows several generated image explanations of AtMan and Chefer for different concepts. More examples of all methods can be found in Appendix A.7. We generally observe more noise in gradient-based methods, in particular around the edges. Note that as VQA only changes target-tokens, we do not need to evaluate the prompt more than once with the AtMan method for different object classes.
In general, the results clearly provide an affirmative answer to (Q1): AtMan is competitive with previous XAI methods, including transformer-specific ones. Next, we will analyze the computational efficiency of AtMan.
4.3 AtMan can do large scale
While AtMan shows competitive performance, it computes, unlike previous approaches, explanations at almost no extra memory cost. Fig. 8 illustrates the runtime and memory consumption on a single NVIDIA A100 80GB GPU. We evaluated the gradient-based transformer XAI method [6] and AtMan. The statistics vary in sequence lengths (colors) from 128 to 1024 tokens, and all experiments are executed with batch size 1 for better comparison.
One can observe that the memory consumption of AtMan is around that of the forward pass (Baseline; green) and increases only marginally over the sequence lengths. In comparison, the method of [6]—and other gradient-based methods—exceeds the memory limit with more than double in memory consumption. Therefore, they fail on larger sequence lengths.
Whereas the memory consumption of AtMan stays almost constant, the execution time significantly increases over sequence length when no further token aggregation is applied upfront. However, note that the exhaustive search loop of AtMan can be run in parallel to decrease its runtime. In particular, this can be achieved by increasing the batch size and naturally by a pipeline-parallel555https://pytorch.org/docs/stable/pipeline.html execution. For instance, since large models beyond 100B are scattered among nodes and thus many GPUs, the effective runtime is reduced by magnitudes to a proximate scale of the forward pass.
Overall, these results clearly provide an affirmative answer to (Q2): Through the memory efficiency of AtMan, it can be applied to large-scale transformer-based models.

5 Conclusion
We proposed AtMan, a modality-agnostic perturbation-based XAI method for transformer networks. In particular, AtMan reduces the complex issue of finding proper perturbations to a single scaling factor per token. As our experiments demonstrate, AtMan outperforms current approaches relying on gradient computation. AtMan is memory-efficient and requires forward passes only, enabling its utilization for deployed large models.
However, some limitations remain unresolved. Whereas AtMan reduces the overall noise on the generated explanation, when compared to gradient-based methods, undesirable artifacts still remain. It is unclear to what extent this is due to the method or the underlying transformer architecture. Through AtMan’s memory efficiency, one is able to evaluate whether models’ explanatory capabilities scale with their size. The extent to which larger models produce explanations that are more difficult to understand, is a question that arises when comparing performance scores. Consequently, scaling explainability with model size should be further studied. Besides this, our paper provides several avenues for future work, including explanatory studies of current generative models impacting our society. Furthermore, it could lay the foundation for not only instructing and, in turn, improving the predictive outcome of autoregressive models based on human feedback [20] but also their explanations [11].
Acknowledgments
This research has benefited from the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK) cluster projects “The Third Wave of AI” and hessian.AI as well as from the German Center for Artificial Intelligence (DFKI) project “SAINT”. Further, we thank Manuel Brack, Felix Friedrich, Marco Bellagente and Constantin Eichenberg for their valuable feedback.
References
- [1] (2020) Quantifying attention flow in transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, (ACL), pp. 4190–4197. Cited by: §1, §2, §2.
- [2] (2021) Too much in common: shifting of embeddings in transformer language models and its implications. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 5117–5130. Cited by: §3.1, §3.3, §3.
- [3] (2021) A survey on the explainability of supervised machine learning. J. Artif. Intell. Res. 70, pp. 245–317. Cited by: §2, §3.3.
- [4] (2020) End-to-end object detection with transformers. In Proceedings of the European Conference of Computer Vision (ECCV), Lecture Notes in Computer Science, pp. 213–229. Cited by: §2.
-
[5]
(2021)
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers
. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 397–406. Cited by: §A.6, §2. -
[6]
(2021)
Transformer interpretability beyond attention visualization.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 782–791. Cited by: §1, §4.3, §4.3, §4. - [7] (2020) A survey of the state of explainable AI for natural language processing. In Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, (AACL/IJCNLP), pp. 447–459. Cited by: §2, §2.
- [8] (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), Cited by: §2.
- [9] (2022) MAGMA – multimodal augmentation of generative models through adapter-based finetuning. In Findings of EMNLP, External Links: Link Cited by: §1, §2, §2.
- [10] (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2021/framework/index.html Cited by: §1.
- [11] (2023) A typology to explore and guide explanatory interactive machine learning. Nature Machine Intelligence. Cited by: §5.
- [12] (2020) Roles and utilization of attention heads in transformer-based neural language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, (ACL), pp. 3404–3417. Cited by: §2.
- [13] (2017) Understanding black-box predictions via influence functions. In Proceedings of the International Conference on Machine Learning, (ICML), Proceedings of Machine Learning Research, Vol. 70, pp. 1885–1894. Cited by: §3.1, §3.
-
[14]
(2020)
Captum: a unified and generic model interpretability library for pytorch
. External Links: 2009.07896 Cited by: §A.1. - [15] (2017) OpenImages: a public dataset for large-scale multi-label and multi-class image classification.. Dataset available from https://github.com/openimages. Cited by: §4.2.
- [16] (2010) Introduction to information retrieval. J. Assoc. Inf. Sci. Technol. 61, pp. 852–853. Cited by: §3.3.
- [17] (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §1, §2.
- [18] (2017) A unified approach to interpreting model predictions. In Proceedings of Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 4765–4774. Cited by: §1, §4.
-
[19]
(2019)
Layer-wise relevance propagation: an overview.
In
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning
, Lecture Notes in Computer Science, Vol. 11700, pp. 193–209. Cited by: §2, §2. - [20] (2022) Training language models to follow instructions with human feedback. In Proceedings of Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2, §5.
- [21] (2021) Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 8748–8763. Cited by: §2.
- [22] (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2383–2392. Cited by: §4.1.
- [23] (2016) ”Why should I trust you?”: explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1135–1144. Cited by: §1, §4.
- [24] (2020-02) Grad-CAM: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision 128, pp. . External Links: Document Cited by: §2, §2, §4.
- [25] (2017) Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 70, pp. 3145–3153. Cited by: §2, §4.
- [26] (2017) Axiomatic attribution for deep networks. Proceedings of the International Conference on Machine Learning, (ICML), pp. 3319–3328. Cited by: §2, §2, §4.
- [27] (2017) Attention is all you need. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §3.2.
- [28] (2022) GIT: a generative image-to-text transformer for vision and language. Transactions on Machine Learning Research. Cited by: §1.
- [29] (2022) Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §1, §2.
Appendix A Appendix
a.1 Remarks on executed benchmarks
We executed all benchmarks faithfully and to the best of our knowledge. The selection of compared methods was made to be rather diverse and obtain a good overview in this field of research. In particular, with regards to the multi-modal transformer scaling behavior, as there are in fact no such studies for AR models yet to compare to. It is possible, for all methods, that there are still improvements we missed in quality as well as performance. However, we see the optimizations of other methods to multi-modal AR transformer models as a research direction on its own.
Chefer.
The integration of Chefer was straightforward. As it can be derived by the visualizations, there are noticeable artifacts, particularly on the edges of images. In this work the underlying transformer model was MAGMA, which is finetuned using sequential adapters. It is possible that this, or the multi-modal AR nature itself, is the cause for these artifacts. We did not further investigate to what extent the adapters are to be particularly integrated in the attribute accumulation of Chefer. Also notice that AtMan often has similar, however not as severe, artifacts.
IxG, IG and GradCAM.
The methods IxG, IG, and (guided) GradCAM failed completely from the quality perspective. Those were the only ones that operated on a pixel level, and thus also included the vision encoder in the backward pass (which is even a requirement for GradCAM; it can only be used to explain Images). We did not further investigate or fine-tune evaluations to any method. All methods are evaluated with the same metrics and therewith give us a reasonable performance comparison without additional customization or configuration.
Details on Results.
For a fair comparison, all experiments were executed on a single GPU, as scaling naturally extends all methods. We also want to highlight that we did not optimize the methods for performance further but rather adopted the repositories as they were. The memory inefficiency of gradient-based methods arises from the backward pass. A maximal memory performant representative is the Single-Layer-Attribution method IxG, which only computes partial derivatives on the input with respect to the loss. Even this approach increases the memory requirement beyond an additional and fails for the scaling experiments up to 34B.
In Fig. 8 we ran Chefer with a full backward pass. We adopted this to the minimum amount of gradients (we saw) possible and plot the full scaling benchmark below in Fig 9666Setting requires_grad=False to every but the attention tensors.
The methods IxG, IG and GradCam are integrated using the library Captum [14]. We expect them to be implemented as performant as possible. IntegratedGradients is a perturbation method on the input, integrating changes over the gradients. The implementation at hand vastly runs OOM. Finally GradCam is a method specialized on CNN networks and therefore does not work for text only (or varying sequence lengths). It requires the least amount of resources but also produces poor results, without further investigations.
AtMan Parallelizability.
As a final remark on AtMan, we want to recall again that the runtime measured in the sequential execution can be drastically reduced due to its parallelizability, i.p., as it only requires forward passes. For sequence length 1024, we measured 1024 iterations in order to explain each token. However note that AtMan can also be applied to only parts or chunks of the sequence (c.f. Sec. A.8), in contrast to gradient methods. Moreover, all tokens to explain can be computed entirely in parallel. In a cluster deployment, these can be distributed amongst all available workers. On top, it can be divided by the available batch size and true pipeline-parallelism.
![]() |
![]() |
![]() |
![]() |
![]() |
a.2 Detailed SQuAD Evaluations
This sections gives more detailed statistics on the scores presented in Tab. 1. First Fig. 10 is the histogram of the token lengths of all explanations. Fig. 11 is the mAP score for all methods on the entire dataset, grouped by the number of questions occuring per instance.


a.3 Detailed OpenImages Evaluations
This section gives more detailed statistics on the scores presented in Tab. 2. Fig. 12 is the histogram of the fraction of label coverage on all images. Fig. 13 and 14 are boxplots for all methods on the entire dataset, for mean average precision as well as recall.



a.4 Discussion of Cosine Embedding Similarity
We fixed the parameter of Eq. 6 empirically by running a line sweep on a randomly sampled subset of the OpenImages dataset once. Throughout this work we set . In Fig. 15 and 16 we compare the mean average precision and recall scores for OpenImages for both variants, with and without correlated token suppression (to threshold ). Clearly the latter outperforms single token suppression.


The following Fig. 17 shows visually the effect on weak image segmentation when correlated suppression of tokens is activated, or when using single token suppression only. Notice how single token only occasionally hits the label, and often marks a token at the edge. This gives us a reason to believe that entropy is accumulated around such edges during layer wise processing. This effect (on these images) completely vanishes with correlated suppression of tokens.

a.5 Variation Discussion of the method
Note that the results of Eq. 4 are directly passed to a softmax operation. The softmax of a vector z is defined as
In particular, the entries and will yield to the results and . So one might argue as follows: If we intent to suppress the entropy of token , we do not want to multiply it by a factor , but rather subtract of it. I.e. we propose the modification
(7) |
The only problem with this Eq. 7
is, that it skews the cosine neighborhood factors. While we experienced this working more naturally in principle, for hand-crafted factors, we did not get best performance in combination with Eq.
6. In the following Fig. 18 and 19, we show analogous evaluations to Fig. 15 and 16. It is in particular interesting that the mode without correlated tokens slightly improves, while the one with slightly decreases in scores, for both metrics.

a.6 Artifacts and failure modes
In Fig. 20 we re-do an experiment (with the same examples) of [5]. I.e. given a VQA text-image prompt that is supposed to be answered with “yes” or “no”, derive the explanation on both, the input question and the image at the same time. The results are very blurry, in particular we noticed exceptionally high noise around the edges of the images for both methods. It sometimes seems to highlight the correct area, in particular in the giraffe and frisbee sample, however along with a lot of noise. It is interesting that the methods highlight different areas, in the questions as well as in the images. In general we observe inconsistent behavior of the model’s completions to this kind of prompt. It is therefore questionable what explainability methods produce at all. They might contain a direction on how to specifically alter the shortcomings of the model itself.

a.7 Qualitative comparison weak image segmentation
In the following Fig. 21 we give several examples for better comparison between the methods on the task of weak image segmentation. To generate the explanations, we prompt the model with “Image This is a picture of ” and extract the scores towards the next target tokens as described with Eq. 5 for AtMan. For multiple target tokens, these results are averaged. In the same fashion, but with an additional backpropagation towards the next target token, we derive the explanations for Chefer and the other gradient methods.

a.8 Application to document q/a
In Fig. 22 we apply AtMan on a larger context of around 500 tokens paragraph wise. The Context is first split into chunks by the delimiter tokens of “.”, “,”, “\n” and “ and”. Then iteratively each chunk is evaluated by prompting in the fashion “ Q: A: ” and the cross entropy extracted towards the target tokens, suppressing the entire chunk at once, as described in Sec. 3. It can be observed that the correct paragraphs are highlighted for the given questions and expected targets. In particular, one can observe the models interpretation, like the mapping of formats or of states to countries. Note in particular that it is not fooled by questions not answered by the text (last row).
