How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

by   Timothee Mickus, et al.

Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.


page 2

page 7

page 8

page 9


Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attenti...

Characterizing the impact of geometric properties of word embeddings on task performance

Analysis of word embedding properties to inform their use in downstream ...

On Isotropy Calibration of Transformers

Different studies of the embedding space of transformer models suggest t...

Analyzing Transformer Dynamics as Movement through Embedding Space

Transformer language models exhibit intelligent behaviors such as unders...

PGT: Pseudo Relevance Feedback Using a Graph-Based Transformer

Most research on pseudo relevance feedback (PRF) has been done in vector...

Variation and Instability in Dialect-Based Embedding Spaces

This paper measures variation in embedding spaces which have been traine...

Please sign up or login with your details

Forgot password? Click here to reset