Linear Representations and Superposition

Published: (February 14, 2026 at 11:29 PM EST)
6 min read

Source: Hacker News

As LLMs become larger, more capable, and more ubiquitous, the field of mechanistic interpretability—that is, understanding the inner workings of these models—becomes increasingly interesting and important.

Similar to how software engineers benefit from having good mental models of file systems and networking, AI researchers and engineers should strive to have a theoretical basis for understanding the “intelligence” that emerges from LLMs. A strong mental model would improve our ability to harness the technology.

In this post I want to cover two fundamental and related concepts (each with its own paper) that I find fascinating from a mathematical perspective:


Linear representation hypothesis (LRH)

The LRH has existed for quite some time, ever since people noticed that the word embeddings produced by Word2Vec satisfied some interesting properties.
If we let (E(x)) be the embedding vector of a word, then we observe the approximate equivalence

[ E(\text{king''}) - E(\text{man”}) + E(\text{woman''}) \;\approx\; E(\text{queen”}). ]

Observations of this form suggest that concepts (e.g., gender in the example) are represented linearly in the geometry of the embedding space—a simple but non‑obvious claim.

Simplified model of an LLM in terms of embeddings and unembeddings.

Fast‑forward to modern LLMs, and the LRH remains a popular way to interpret what is happening inside these models. The Park et al. paper presents a mathematical framing of the hypothesis that treats most of the inner workings (MLP, attention, etc.) as a black box and focuses on two separate representation spaces that have the same dimensionality as the model:

  • Embedding space – the final hidden states of the network live here ((E(x)) for an input context (x)). This is analogous to the word‑embedding formulation and is where you would perform interventions that affect the model’s behavior (see the “monosemanticity” scaling paper).
  • Unembedding space – the rows of the unembedding matrix live here ((U(y)) for each output token (y)). A linear probe over the hidden state (to evaluate the presence of a concept) corresponds to a vector in this space.

There are analogous LRH formulations in the two spaces. Let (C) denote the directional concept of gender (male → female). Then any pair of input contexts that differ only in that concept should satisfy, e.g.

[ E(\text{Long live the queen''}) - E(\text{Long live the king”}) ;=; \alpha , E_C, \qquad \alpha \ge 0, ]

where (E_C) is a constant vector in the embedding space (the embedding representation). Similarly, any pair of output tokens that differ only in that concept should satisfy

[ U(\text{queen''}) - U(\text{king”}) ;=; \beta , U_C, \qquad \beta \ge 0, ]

where (U_C) is the unembedding representation. In other words, applying the concept has a linear effect in both spaces.

The paper shows that these two representations are isomorphic, which unifies the intervention and linear‑probe ideas. Empirically, they verify on Llama 2 that they can find embedding and unembedding representations for a variety of concepts (e.g., present → past tense, noun → plural, English → French) that approximately fit their theoretical framework.

Approximate orthogonality of concept representations in Llama 2. Source: Park et al.


Superposition

Assuming concepts truly have linear representations, it would be natural to expect unrelated concepts to be orthogonal. Otherwise, applying the male → female direction could unintentionally affect the English → French direction, which does not make sense.

One of the key results from Park et al. is that this orthogonality does not hold under the standard Euclidean inner product. Instead, it emerges under a “causal inner product” derived from the unembedding matrix. Only when we view concept representations through that lens do we obtain the orthogonality we expect.

However, the representation space of modern LLMs is relatively small (typically 2 K–16 K dimensions). How can such a low‑dimensional space accommodate the huge number of language features that far exceed its dimensionality? It is impossible for all features to be mutually orthogonal, regardless of the geometry.

The interference effect of non‑orthogonal features. Source: Anthropic.

This is where superposition comes into play. In low‑dimensional spaces, when you have (N) vectors in a (d)-dimensional space with (N > d), the vectors inevitably interfere: their inner products acquire non‑trivial magnitudes. Superposition provides a framework for understanding how many more concepts than dimensions can be packed into a model by allowing them to share (i.e., be superimposed onto) the same subspace while remaining approximately disentangled under the appropriate causal inner product.


Low‑Dimensional Intuition vs. Higher Dimensions

Those examples where low‑dimensional intuition does not extend to higher dimensions are evidenced by the Johnson–Lindenstrauss lemma. An implication of the lemma is that you can choose exponentially many vectors (in the number of dimensions) that are almost orthogonal—i.e., the inner products between any pair are bounded by a small constant. This can be viewed as the flip side of the curse of dimensionality.


Superposition in Toy Models

The Anthropic paper demonstrates the superposition phenomenon on small, synthetic datasets. One particularly interesting observation is that superposition does not occur with no activation function (purely linear computation), but it does occur with a nonlinear one (ReLU in their case). The non‑linearity allows the model to manage interference in a productive way. This still only works well because of the natural sparsity of these features in the data—models learn to superimpose features that are unlikely to be simultaneously present.

Visualization

Visualization of a square antiprism, the energy‑minimizing arrangement of 8 points on a 3‑D unit sphere.

Figure: Square antiprism – the energy‑minimizing arrangement of eight points on a 3‑D unit sphere.


Regular Structures in Embedding Space

In experimental setups where the synthetic features have equal importance and sparsity, the authors observe that the embedding vectors learned by the model form regular structures in the embedding space, e.g.:

Coincidentally, these are the same types of structures encountered in older research on spherical codes. Those structures emerged from gradient‑descent‑like algorithms that minimized the energy (analogous to the Thomson problem) of point arrangements on unit hyperspheres. It’s fun to see the overlap of multiple fields!


Takeaway

Viewing features as linear representations—even if not the complete story (see this paper)—provides a valuable framework for interpreting and intervening in LLMs. The framework has a solid theoretical basis that is backed up empirically. Sparsity, superimposition, and the non‑intuitive nature of higher‑dimensional spaces give us a window into understanding how the complexity of language (and perhaps intelligence?) gets captured by these models.

0 views
Back to Blog

Related posts

Read more »