[Paper] Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

Published: 3 days ago (June 10, 2026 at 09:26 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12058v1

Overview

Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training large language models.

Key Contributions

This paper presents research in the following areas:

stat.ML
cond-mat.dis-nn
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of stat.ML.

Authors

Itay Lavie
Kirsten Fischer
Andrey Lekov
Frederic Van Maele
Zohar Ringel
Moritz Helias

Paper Information

arXiv ID: 2606.12058v1
Categories: stat.ML, cond-mat.dis-nn, cs.LG
Published: June 10, 2026
PDF: Download PDF

[Paper] Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] Understanding Truncated Positional Encodings for Graph Neural Networks