A Geometric Notion of Causal Probing

by   Clément Guerner, et al.

Large language models rely on real-valued representations of text to make their predictions. These representations contain information learned from the data that the model has trained on, including knowledge of linguistic properties and forms of demographic bias, e.g., based on gender. A growing body of work has considered removing information about concepts such as these using orthogonal projections onto subspaces of the representation space. We contribute to this body of work by proposing a formal definition of intrinsic information in a subspace of a language model's representation space. We propose a counterfactual approach that avoids the failure mode of spurious correlations (Kumar et al., 2022) by treating components in the subspace and its orthogonal complement independently. We show that our counterfactual notion of information in a subspace is optimized by a causal concept subspace. Furthermore, this intervention allows us to attempt concept controlled generation by manipulating the value of the conceptual component of a representation. Empirically, we find that R-LACE (Ravfogel et al., 2022) returns a one-dimensional subspace containing roughly half of total concept information under our framework. Our causal controlled intervention shows that, for at least one model, the subspace returned by R-LACE can be used to manipulate the concept value of the generated word with precision.


page 1

page 2

page 3

page 4


CausaLM: Causal Model Explanation Through Counterfactual Language Models

Understanding predictions made by deep neural networks is notoriously di...

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

Large pre-trained language models are often trained on large volumes of ...

Testing Causal Models of Word Meaning in GPT-3 and -4

Large Language Models (LLMs) have driven extraordinary improvements in N...

Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation

Bolukbasi et al. (2016) presents one of the first gender bias mitigation...

Adversarial Concept Erasure in Kernel Space

The representation space of neural models for textual data emerges in an...

Finding Concept-specific Biases in Form–Meaning Associations

This work presents an information-theoretic operationalisation of cross-...

Please sign up or login with your details

Forgot password? Click here to reset