Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

02/24/2023
by   Max Lamparth, et al.
0

Recent advancements in interpretability research made transformer language models more transparent. This progress led to a better understanding of their inner workings for toy and naturally occurring models. However, how these models internally process sentiment changes has yet to be sufficiently answered. In this work, we introduce a new interpretability tool called PCP ablation, where we replace modules with low-rank matrices based on the principal components of their activations, reducing model parameters and their behavior to essentials. We demonstrate PCP ablations on MLP and attention layers in backdoored toy, backdoored large, and naturally occurring models. We determine MLPs as most important for the backdoor mechanism and use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements via PCP ablation.

READ FULL TEXT
research
02/01/2023

Feed-Forward Blocks Control Contextualization in Masked Language Models

Understanding the inner workings of neural network models is a crucial s...
research
04/26/2022

LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models

The opaque nature and unexplained behavior of transformer-based language...
research
11/28/2020

Understanding How BERT Learns to Identify Edits

Pre-trained transformer language models such as BERT are ubiquitous in N...
research
05/15/2023

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Obtaining human-interpretable explanations of large, general-purpose lan...
research
05/24/2023

Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis

Mathematical reasoning in large language models (LLMs) has garnered atte...
research
01/12/2023

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new...
research
04/28/2023

Towards Automated Circuit Discovery for Mechanistic Interpretability

Recent work in mechanistic interpretability has reverse-engineered nontr...

Please sign up or login with your details

Forgot password? Click here to reset