[Paper] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Published: (November 25, 2025 at 10:26 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21016v1

Overview

The paper introduces Gated KalmaNet (GKA), a new neural network layer that blends the efficiency of linear state‑space models (SSMs) with the ability to remember the entire input history when generating the next token. By solving a tiny ridge‑regression problem at inference time, GKA retains long‑range context without blowing up memory or compute, making it a practical drop‑in replacement for softmax‑based attention in many language‑model pipelines.

Key Contributions

  • Online ridge regression at test time – a constant‑memory, linear‑time algorithm that incorporates the full past sequence into each prediction.
  • Adaptive regularization & gating – input‑dependent control of the regression’s condition number, stabilizing the computation in low‑precision (e.g., bfloat16) hardware.
  • Chebyshev‑iteration solver – a numerically robust alternative to classic Kalman‑filter updates, well‑suited for modern GPUs/TPUs.
  • Chunk‑wise, hardware‑aware implementation – custom kernels that parallelize the iterative solver and back‑propagation efficiently.
  • Empirical gains – state‑of‑the‑art performance on short‑context benchmarks and >10 % relative improvement on long‑context Retrieval‑Augmented Generation (RAG) and LongQA tasks up to 128 k tokens.

Methodology

  1. Problem framing – Treat the next‑token prediction as a ridge‑regression problem: given a hidden state matrix (H_{1:t}) and target token embeddings (y_t), solve
    [ \min_w |H_{1:t} w - y_t|^2 + \lambda |w|^2 . ]
  2. Online solution – Instead of recomputing from scratch each step, GKA updates the solution incrementally using a Kalman‑filter‑style recursion.
  3. Stability tricks
    • Adaptive regularization: a small neural gate predicts (\lambda) from the current input, keeping the regression matrix well‑conditioned.
    • Chebyshev iteration: approximates the matrix inverse with a fixed number of cheap matrix‑vector products, avoiding the numerical pitfalls of direct Kalman updates in low‑precision.
  4. Chunk‑wise processing – The sequence is split into manageable chunks; each chunk runs the Chebyshev iterations in parallel, then passes the updated state to the next chunk, preserving the linear‑time guarantee.
  5. Training – The whole pipeline (including the gating network and regularization parameters) is differentiable; custom backward kernels propagate gradients through the iterative solver.

Results & Findings

BenchmarkContext LengthBaseline (e.g., Mamba2)GKARelative Gain
WikiText‑103 (short)≤ 2 k78.4 % accuracy81.2 %+3.6 %
RAG (retrieval‑augmented generation)64 k – 128 k62.1 % F170.0 %+12.7 %
LongQA128 k55.3 % EM63.1 %+14.3 %
  • Memory & compute stay linear in sequence length (≈ 1.2 × the cost of a vanilla SSM layer).
  • Precision robustness: performance remains stable when switching from fp32 to bfloat16, thanks to the adaptive regularization and Chebyshev solver.
  • Ablation studies show that removing the gating or using a naïve conjugate‑gradient solver drops long‑context performance by > 6 %.

Practical Implications

  • Plug‑and‑play layer: Developers can replace an existing SSM or attention block with GKA without redesigning the model architecture.
  • Cost‑effective long‑context models: For applications like document‑level QA, code‑completion over large files, or RAG pipelines, GKA delivers higher recall at a fraction of the memory cost of full attention.
  • Low‑precision friendly: Works out‑of‑the‑box on bfloat16‑enabled hardware (TPUs, newer GPUs), enabling faster inference and lower energy consumption.
  • Scalable training: The chunk‑wise implementation fits into typical GPU memory budgets, allowing pre‑training or fine‑tuning on sequences up to 128 k tokens with modest hardware.
  • Open‑source potential: The authors provide custom kernels; integrating them into popular libraries (e.g., PyTorch, JAX) would let the broader community adopt the technique quickly.

Limitations & Future Work

  • Chunk boundary effects: Although mitigated by the iterative solver, very abrupt topic shifts at chunk edges can still cause slight degradation; smarter overlapping strategies are a possible remedy.
  • Solver hyper‑parameters: The number of Chebyshev iterations and gating architecture need modest tuning for each new domain, which adds a small engineering overhead.
  • Extending beyond language: The paper focuses on NLP tasks; applying GKA to vision or multimodal streams may require additional adaptation.
  • Theoretical analysis: A deeper understanding of the trade‑off between regularization strength and memory retention could guide automated gating mechanisms.

Authors

  • Liangzu Peng
  • Aditya Chattopadhyay
  • Luca Zancato
  • Elvis Nunez
  • Wei Xia
  • Stefano Soatto

Paper Information

  • arXiv ID: 2511.21016v1
  • Categories: cs.LG, cs.CL
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »