[Paper] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Published: 1 week ago (November 25, 2025 at 10:26 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21016v1

Overview

The paper introduces Gated KalmaNet (GKA), a new neural network layer that blends the efficiency of linear state‑space models (SSMs) with the ability to remember the entire input history when generating the next token. By solving a tiny ridge‑regression problem at inference time, GKA retains long‑range context without blowing up memory or compute, making it a practical drop‑in replacement for softmax‑based attention in many language‑model pipelines.

Key Contributions

Online ridge regression at test time – a constant‑memory, linear‑time algorithm that incorporates the full past sequence into each prediction.
Adaptive regularization & gating – input‑dependent control of the regression’s condition number, stabilizing the computation in low‑precision (e.g., bfloat16) hardware.
Chebyshev‑iteration solver – a numerically robust alternative to classic Kalman‑filter updates, well‑suited for modern GPUs/TPUs.
Chunk‑wise, hardware‑aware implementation – custom kernels that parallelize the iterative solver and back‑propagation efficiently.
Empirical gains – state‑of‑the‑art performance on short‑context benchmarks and >10 % relative improvement on long‑context Retrieval‑Augmented Generation (RAG) and LongQA tasks up to 128 k tokens.

Methodology

Problem framing – Treat the next‑token prediction as a ridge‑regression problem: given a hidden state matrix (H_{1:t}) and target token embeddings (y_t), solve
[ \min_w |H_{1:t} w - y_t|^2 + \lambda |w|^2 . ]
Online solution – Instead of recomputing from scratch each step, GKA updates the solution incrementally using a Kalman‑filter‑style recursion.
Stability tricks
- Adaptive regularization: a small neural gate predicts (\lambda) from the current input, keeping the regression matrix well‑conditioned.
- Chebyshev iteration: approximates the matrix inverse with a fixed number of cheap matrix‑vector products, avoiding the numerical pitfalls of direct Kalman updates in low‑precision.
Chunk‑wise processing – The sequence is split into manageable chunks; each chunk runs the Chebyshev iterations in parallel, then passes the updated state to the next chunk, preserving the linear‑time guarantee.
Training – The whole pipeline (including the gating network and regularization parameters) is differentiable; custom backward kernels propagate gradients through the iterative solver.

Results & Findings

Benchmark	Context Length	Baseline (e.g., Mamba2)	GKA	Relative Gain
WikiText‑103 (short)	≤ 2 k	78.4 % accuracy	81.2 %	+3.6 %
RAG (retrieval‑augmented generation)	64 k – 128 k	62.1 % F1	70.0 %	+12.7 %
LongQA	128 k	55.3 % EM	63.1 %	+14.3 %

Memory & compute stay linear in sequence length (≈ 1.2 × the cost of a vanilla SSM layer).
Precision robustness: performance remains stable when switching from fp32 to bfloat16, thanks to the adaptive regularization and Chebyshev solver.
Ablation studies show that removing the gating or using a naïve conjugate‑gradient solver drops long‑context performance by > 6 %.

Practical Implications

Plug‑and‑play layer: Developers can replace an existing SSM or attention block with GKA without redesigning the model architecture.
Cost‑effective long‑context models: For applications like document‑level QA, code‑completion over large files, or RAG pipelines, GKA delivers higher recall at a fraction of the memory cost of full attention.
Low‑precision friendly: Works out‑of‑the‑box on bfloat16‑enabled hardware (TPUs, newer GPUs), enabling faster inference and lower energy consumption.
Scalable training: The chunk‑wise implementation fits into typical GPU memory budgets, allowing pre‑training or fine‑tuning on sequences up to 128 k tokens with modest hardware.
Open‑source potential: The authors provide custom kernels; integrating them into popular libraries (e.g., PyTorch, JAX) would let the broader community adopt the technique quickly.

Limitations & Future Work

Chunk boundary effects: Although mitigated by the iterative solver, very abrupt topic shifts at chunk edges can still cause slight degradation; smarter overlapping strategies are a possible remedy.
Solver hyper‑parameters: The number of Chebyshev iterations and gating architecture need modest tuning for each new domain, which adds a small engineering overhead.
Extending beyond language: The paper focuses on NLP tasks; applying GKA to vision or multimodal streams may require additional adaptation.
Theoretical analysis: A deeper understanding of the trade‑off between regularization strength and memory retention could guide automated gating mechanisms.

Authors

Liangzu Peng
Aditya Chattopadhyay
Luca Zancato
Elvis Nunez
Wei Xia
Stefano Soatto

Paper Information

arXiv ID: 2511.21016v1
Categories: cs.LG, cs.CL
Published: November 26, 2025
PDF: Download PDF

[Paper] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

State of AI: An Empirical 100T Token Study with OpenRouter

How AI is transforming work at Anthropic

See Through Walls: AI's New Eye on Occluded Motion by Arvind Sundararajan

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee