[Paper] Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
Source: arXiv - 2605.05189v1
Overview
This paper investigates a fundamental question for anyone building fast, memory‑efficient associative lookup systems: how many key‑value pairs can a linear (matrix‑based) memory store and still retrieve them reliably? By analysing two common retrieval strategies—winner‑take‑all (top‑1) and listwise (top‑k) decoding—the authors uncover sharp capacity thresholds and show why the “log‑factor” that appears in many neural‑network memory models is not a loose bound but an intrinsic cost of exact top‑1 retrieval.
Key Contributions
- Exact capacity scaling for top‑1 retrieval: Proves that a (d \times d) linear memory can store at most (n = \Theta!\bigl(d^{2}/\log d\bigr)) associations if the correct item must be the unique highest scorer.
- Sharp phase transition for Correlation Matrix Memory (CMM): Shows that the classic outer‑product storage scheme hits the above bound with a sudden “satisfiable ↔ unsatisfiable” transition.
- Listwise retrieval framework: Introduces the Tail‑Average Margin (TAM) criterion, a convex surrogate that guarantees the true target stays within a controllable candidate list.
- Quadratic capacity under TAM: Demonstrates that when exact top‑1 is relaxed to listwise inclusion, the memory can store a linear number of pairs, i.e. (n = \Theta(d^{2})).
- Variational theory for the empirical‑risk minimizer: Derives a two‑parameter scalar variational principle that predicts performance metrics (score distributions, margins, percentile profiles) across loads ( \alpha = n/d^{2}).
- Ridgeless limit closed‑form critical load: Provides an explicit threshold separating satisfiable and unsatisfiable regimes when regularization vanishes.
- Conjectural refinement of the top‑1 threshold: Using a “small‑tail extrapolation,” the authors conjecture the exact constant in the top‑1 capacity law: (d^{2} \sim 2 n \log n).
Methodology
- Problem setup – The authors model each key‑value pair as a pair of independent isotropic Gaussian vectors in (\mathbb{R}^{d}). The memory is a linear operator (M \in \mathbb{R}^{d \times d}) that stores all pairs simultaneously via superposition (outer‑product sum).
- Retrieval criteria –
- Winner‑take‑all: The correct target must have a higher inner product with the query than any distractor.
- Listwise (TAM): The correct target must belong to the top‑(k) (or a set defined by an average margin over the tail of the score distribution). TAM is a convex surrogate that can be optimized efficiently.
- Statistical‑mechanics analysis – By treating the stored vectors as random and using extreme‑value theory, the authors derive asymptotic expressions for the probability that the correct target beats its strongest competitor.
- Variational principle – For the TAM‑based empirical risk minimizer, they reduce the high‑dimensional optimization to a two‑parameter scalar problem whose solution yields closed‑form predictions for the whole system’s behavior.
- Phase‑transition proof – They rigorously show that, for the CMM construction, the probability of successful retrieval jumps from near‑1 to near‑0 as the load crosses the derived threshold, establishing a sharp capacity bound.
Results & Findings
| Retrieval mode | Capacity scaling (pairs (n) vs. matrix size (d^{2})) | Key theoretical insight |
|---|---|---|
| Top‑1 (winner‑take‑all) | (n = \Theta!\bigl(d^{2}/\log d\bigr)) | The logarithmic factor is the price of beating the maximum distractor (extreme‑value statistics). |
| Listwise (TAM) | (n = \Theta(d^{2})) | Relaxing the criterion to “stay in the top‑k” removes the extreme‑value penalty, yielding a purely quadratic capacity. |
| Ridgeless TAM limit | Critical load (\alpha_{c}=1/2) (exact) | No regularization needed; the system saturates at half the degrees of freedom. |
| Empirical‑risk minimizer predictions | Accurate closed‑form formulas for score distributions, margins, and percentile curves across loads | Validated by extensive simulations; matches observed phase transitions. |
| Conjectured top‑1 constant | (d^{2} \approx 2 n \log n) | Suggests the previously known (\Theta) bound can be sharpened to a precise constant. |
In short, the paper quantifies when a linear associative memory will fail or succeed, and it does so with mathematically tight thresholds rather than loose big‑O statements.
Practical Implications
- Design of fast key‑value caches: Engineers can now size a linear memory (e.g., a learned index or a hardware‑accelerated associative array) with confidence that storing more than (d^{2}/\log d) items will inevitably cause top‑1 lookups to break down.
- Choosing retrieval strategies: If an application can tolerate a short candidate list (e.g., a downstream reranker), switching to a listwise criterion like TAM can double the usable capacity, turning a logarithmic penalty into a pure quadratic one.
- Hardware accelerators: The sharp phase transition means that a modest increase in matrix dimensions can dramatically improve reliability, guiding silicon area allocation for on‑chip associative memories.
- Training of memory‑augmented neural nets: The variational theory provides a principled way to set regularization (ridge) and to predict when the model will start over‑loading its external memory, potentially informing curriculum‑learning schedules.
- Benchmarking and diagnostics: The closed‑form predictions for score distributions give developers a diagnostic tool: by measuring empirical margins they can infer how close they are to the theoretical capacity limit.
Limitations & Future Work
- Gaussian assumption: The analysis relies on isotropic Gaussian keys/values. Real‑world embeddings (e.g., word vectors, image features) often exhibit heavy tails or anisotropy, which may shift the thresholds.
- Linear memory only: Non‑linear or attention‑based retrieval mechanisms are not covered; extending the theory to such architectures remains open.
- Finite‑size effects: While asymptotic results are sharp, practical systems with moderate (d) (e.g., (d=64)–(256)) may experience smoother transitions; empirical calibration is needed.
- TAM hyperparameters: The listwise criterion introduces a tail‑averaging window and margin parameter; optimal settings for specific applications are not fully explored.
- Conjectural constant: The proposed exact top‑1 constant (2) is based on extrapolation; a rigorous proof would solidify the claim.
Future research directions include relaxing distributional assumptions, extending the variational framework to non‑linear memories, and empirically validating the TAM approach on large‑scale retrieval tasks (e.g., neural retrieval, recommendation systems).
Authors
- Nicholas Barnfield
- Juno Kim
- Eshaan Nichani
- Jason D. Lee
- Yue M. Lu
Paper Information
- arXiv ID: 2605.05189v1
- Categories: stat.ML, cs.IT, cs.LG
- Published: May 6, 2026
- PDF: Download PDF