[Paper] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
Source: arXiv - 2511.21016v1
Overview
The paper introduces Gated KalmaNet (GKA), a new neural network layer that blends the efficiency of linear state‑space models (SSMs) with the ability to remember the entire input history when generating the next token. By solving a tiny ridge‑regression problem at inference time, GKA retains long‑range context without blowing up memory or compute, making it a practical drop‑in replacement for softmax‑based attention in many language‑model pipelines.
Key Contributions
- Online ridge regression at test time – a constant‑memory, linear‑time algorithm that incorporates the full past sequence into each prediction.
- Adaptive regularization & gating – input‑dependent control of the regression’s condition number, stabilizing the computation in low‑precision (e.g., bfloat16) hardware.
- Chebyshev‑iteration solver – a numerically robust alternative to classic Kalman‑filter updates, well‑suited for modern GPUs/TPUs.
- Chunk‑wise, hardware‑aware implementation – custom kernels that parallelize the iterative solver and back‑propagation efficiently.
- Empirical gains – state‑of‑the‑art performance on short‑context benchmarks and >10 % relative improvement on long‑context Retrieval‑Augmented Generation (RAG) and LongQA tasks up to 128 k tokens.
Methodology
- Problem framing – Treat the next‑token prediction as a ridge‑regression problem: given a hidden state matrix (H_{1:t}) and target token embeddings (y_t), solve
[ \min_w |H_{1:t} w - y_t|^2 + \lambda |w|^2 . ] - Online solution – Instead of recomputing from scratch each step, GKA updates the solution incrementally using a Kalman‑filter‑style recursion.
- Stability tricks
- Adaptive regularization: a small neural gate predicts (\lambda) from the current input, keeping the regression matrix well‑conditioned.
- Chebyshev iteration: approximates the matrix inverse with a fixed number of cheap matrix‑vector products, avoiding the numerical pitfalls of direct Kalman updates in low‑precision.
- Chunk‑wise processing – The sequence is split into manageable chunks; each chunk runs the Chebyshev iterations in parallel, then passes the updated state to the next chunk, preserving the linear‑time guarantee.
- Training – The whole pipeline (including the gating network and regularization parameters) is differentiable; custom backward kernels propagate gradients through the iterative solver.
Results & Findings
| Benchmark | Context Length | Baseline (e.g., Mamba2) | GKA | Relative Gain |
|---|---|---|---|---|
| WikiText‑103 (short) | ≤ 2 k | 78.4 % accuracy | 81.2 % | +3.6 % |
| RAG (retrieval‑augmented generation) | 64 k – 128 k | 62.1 % F1 | 70.0 % | +12.7 % |
| LongQA | 128 k | 55.3 % EM | 63.1 % | +14.3 % |
- Memory & compute stay linear in sequence length (≈ 1.2 × the cost of a vanilla SSM layer).
- Precision robustness: performance remains stable when switching from fp32 to bfloat16, thanks to the adaptive regularization and Chebyshev solver.
- Ablation studies show that removing the gating or using a naïve conjugate‑gradient solver drops long‑context performance by > 6 %.
Practical Implications
- Plug‑and‑play layer: Developers can replace an existing SSM or attention block with GKA without redesigning the model architecture.
- Cost‑effective long‑context models: For applications like document‑level QA, code‑completion over large files, or RAG pipelines, GKA delivers higher recall at a fraction of the memory cost of full attention.
- Low‑precision friendly: Works out‑of‑the‑box on bfloat16‑enabled hardware (TPUs, newer GPUs), enabling faster inference and lower energy consumption.
- Scalable training: The chunk‑wise implementation fits into typical GPU memory budgets, allowing pre‑training or fine‑tuning on sequences up to 128 k tokens with modest hardware.
- Open‑source potential: The authors provide custom kernels; integrating them into popular libraries (e.g., PyTorch, JAX) would let the broader community adopt the technique quickly.
Limitations & Future Work
- Chunk boundary effects: Although mitigated by the iterative solver, very abrupt topic shifts at chunk edges can still cause slight degradation; smarter overlapping strategies are a possible remedy.
- Solver hyper‑parameters: The number of Chebyshev iterations and gating architecture need modest tuning for each new domain, which adds a small engineering overhead.
- Extending beyond language: The paper focuses on NLP tasks; applying GKA to vision or multimodal streams may require additional adaptation.
- Theoretical analysis: A deeper understanding of the trade‑off between regularization strength and memory retention could guide automated gating mechanisms.
Authors
- Liangzu Peng
- Aditya Chattopadhyay
- Luca Zancato
- Elvis Nunez
- Wei Xia
- Stefano Soatto
Paper Information
- arXiv ID: 2511.21016v1
- Categories: cs.LG, cs.CL
- Published: November 26, 2025
- PDF: Download PDF