Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

Published: 3 days ago (February 12, 2026 at 05:00 PM EST)

7 min read

Source: VentureBeat

Source: VentureBeat

Dynamic Memory Sparsification (DMS)

What DMS Does

Compresses the KV cache – the temporary key‑value memory that LLMs generate while processing prompts and reasoning through problems or documents.
Discards redundant cache entries while preserving (and sometimes even improving) the model’s reasoning performance.

Why It Matters

Longer “thinking” time – LLMs can explore more solution paths without hitting memory limits.
No speed penalty – the compression is efficient enough that inference speed remains unchanged.

Key Takeaway

DMS shows that substantial memory savings are possible without degrading model intelligence, addressing a major bottleneck in scaling LLM reasoning.

Reference

Paper: Dynamic Memory Sparsification – arXiv:2506.05345

The Bottleneck of Reasoning

LLMs improve their performance on complex tasks by generating chain‑of‑thought tokens—essentially writing out their reasoning steps before arriving at a final answer. Inference‑time scaling techniques leverage this by giving the model a larger budget to generate these “thinking” tokens or to explore multiple potential reasoning paths in parallel.

Why it hurts performance

As the model generates more tokens, it builds up a key‑value (KV) cache.
The KV cache grows linearly with the length of the reasoning chain, consuming large amounts of GPU memory.
When memory pressure rises, the hardware spends more time reading data from memory than actually computing, which:
- Slows down generation and increases latency.
- Caps the number of concurrent users—running out of VRAM can crash the system or degrade it to a crawl.

“The question isn’t just about hardware quantity; it’s about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost.”
— Piotr Nawrot, Senior Deep Learning Engineer, Nvidia (as quoted by VentureBeat)

Prior attempts to mitigate the issue

Approach	How it works	Drawbacks
Heuristic‑based eviction (e.g., sliding‑window)	Keeps only the most recent tokens in the KV cache, discarding older ones.	May delete critical information, hurting accuracy.
Standard eviction heuristics	Selects “old and unused” tokens for removal based on simple rules.	Relies on approximations of the model’s internal mechanics; can lead to incorrect answers.
Paging to slower memory	Offloads unused KV cache portions to host RAM or SSD.	Constant swapping introduces latency, making real‑time applications sluggish.

References

Chain‑of‑thought – VentureBeat: Don’t believe reasoning models? Chains of thought says Anthropic
KV cache – VentureBeat: Mixture of Recursions delivers 2× faster inference – here’s how to implement it

Detailed Overview of Dynamic Memory Sparsification (DMS)

Dynamic Memory Sparsification (DMS) retrofits existing large language models (LLMs) so they can intelligently manage their own memory. Instead of applying a fixed rule for token deletion, DMS trains the model to recognize which tokens are essential for future reasoning and which are disposable.

“It doesn’t just guess importance; it learns a policy that explicitly preserves the model’s final output distribution,” — Nawrot

How DMS Works

Step	Description
1️⃣ Model selection	Start with a standard, pre‑trained LLM (e.g., Llama 3, Qwen 3).
2️⃣ Freeze weights	Freeze the bulk of the model’s parameters (similar to LoRA) to keep training cheap.
3️⃣ Add “keep/evict” heads	Repurpose neurons in the attention layers to output a binary signal for each token: keep or evict.
4️⃣ Train a lightweight policy	Run a short fine‑tuning (≈ 1 000 steps) so the model learns a policy that predicts token importance.
5️⃣ Deploy	The resulting model uses standard kernels and can be dropped into existing inference stacks without custom hardware.

Key point: The process does not require training the model from scratch, which would be prohibitively expensive.

Delayed Eviction

Standard sparsification deletes a token the moment it is deemed unimportant, which can be risky because the model may still need a brief window to integrate that token’s context. DMS introduces delayed eviction:

Flag a token for removal.
Retain it in a short‑lived buffer (a few hundred steps).
Allow the model to extract any remaining useful information and merge it into the current context.
Evict the token from the KV cache after the window expires.

“The ‘delayed eviction’ mechanism is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between — they carry some information, but not enough to justify occupying an entire slot in memory,” — Nawrot.

Performance Highlights

Training cost: ~1 000 steps of fine‑tuning (a tiny fraction of the original pre‑training compute).
Speed: A Qwen‑3‑8B model can be retrofitted on a single DGX H100 in hours.
Compatibility: Uses standard kernels; no custom hardware or extensive software rewrites required.

Visual Overview

Dynamic Memory Sparsification illustration

Takeaway

DMS offers a lightweight, retrofittable solution for extending the context window of existing LLMs. By learning a token‑importance policy and employing delayed eviction, it preserves crucial information while freeing up memory, all without the massive cost of training a new model from scratch.

DMS in Action

To validate the technique, the researchers applied Dynamic Memory Scaling (DMS) to several reasoning models, including the Qwen‑R1 series (distilled from DeepSeek R1) and Llama 3.2. They evaluated the models on challenging benchmarks such as AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).

Key Findings

Benchmark	Model (with DMS)	Baseline (no DMS)	Δ Score / Throughput
AIME 24 (math)	Qwen‑R1 32B	Standard Qwen‑R1 32B (same memory‑bandwidth budget)	+12.0 points
Needle‑in‑a‑Haystack (long‑context retrieval)	DMS‑enabled variants	Standard models	Higher retrieval accuracy
Enterprise throughput (Qwen‑3 8B)	DMS‑enabled	Vanilla Qwen‑3 8B	≈ 5× higher throughput (same accuracy)

How DMS Helps

Deeper & wider reasoning: By compressing the cache, the model can “think” more extensively within the same memory and compute budget.
Cleaner context: Active memory management prevents the accumulation of noisy tokens, which benefits long‑context tasks.
Hardware efficiency: A smaller memory cache reduces GPU fetch latency, translating into faster query handling and lower hardware costs.

Visual Summary

DMS improves model performance on reasoning tasks over vanilla LLMs for equal compute budget (source: arXiv)

Implications for Enterprise Deployments

Throughput boost: A single server can handle up to 5× more queries per second without sacrificing quality.
Cost savings: Reduced memory bandwidth and GPU idle time lower operational expenses.
Scalability: Smaller cache footprints enable higher model density per GPU, facilitating larger deployments on existing hardware.

Overall, DMS demonstrates that intelligent memory management can deliver substantial gains in both model performance and system efficiency, challenging the conventional belief that compression inevitably harms long‑context understanding.

The Future of Memory

Nvidia has released DMS as part of its KV‑Press library. Regarding how enterprises can get started with DMS, Nawrot emphasized that the barrier to entry is low:

“The minimum viable infrastructure is standard Hugging Face pipelines — no custom CUDA kernels are required,”
— Nawrot, noting that the code is fully compatible with standard FlashAttention.

Key Takeaways

Low entry barrier – Use existing Hugging Face pipelines; no need for custom CUDA kernels.
Compatibility – Works out‑of‑the‑box with FlashAttention and newer architectures such as the Multi‑Head Latent Attention (MLA) used in DeepSeek’s models.
Future vision – DMS is seen as a distinct, intelligent layer of the AI stack, enabling more efficient memory management.

Looking Ahead

Integration with MLA – Combining DMS with MLA could yield even greater efficiency gains.
Scaling agentic systems – As enterprises shift from simple chatbots to complex, reasoning‑heavy agents, inference cost becomes a primary concern.
Sustainable scaling – Techniques like DMS provide a path to scale these capabilities sustainably.

“We’ve barely scratched the surface of what is possible,” Nawrot said. “We expect inference‑time scaling to further evolve.”