[Paper] Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Source: arXiv - 2606.12966v1
Overview
Grokking — where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy — is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.’s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model’s FSD lags, confirming the precursor is a multi-block circuit property.
Key Contributions
This paper presents research in the following areas:
- cs.LG
- cs.NE
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.LG.
Authors
- Achyuthan Sivasankar
Paper Information
- arXiv ID: 2606.12966v1
- Categories: cs.LG, cs.NE
- Published: June 11, 2026
- PDF: Download PDF