Interpreting Neural Networks through the Polytope Lens

by   Sid Black, et al.

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.


page 4

page 7

page 8

page 11

page 14

page 15

page 26


Disentangling Neuron Representations with Concept Vectors

Mechanistic interpretability aims to understand how models store represe...

Polysemanticity and Capacity in Neural Networks

Individual neurons in neural networks often represent a mixture of unrel...

Double framed moduli spaces of quiver representations

Motivated by problems in the neural networks setting, we study moduli sp...

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' inte...

On the Level Sets and Invariance of Neural Tuning Landscapes

Visual representations can be defined as the activations of neuronal pop...

TCAV: Relative concept importance testing with Linear Concept Activation Vectors

Neural networks commonly offer high utility but remain difficult to inte...

Topological Understanding of Neural Networks, a survey

We look at the internal structure of neural networks which is usually tr...

Please sign up or login with your details

Forgot password? Click here to reset