Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy
Source: VentureBeat
Dynamic Memory Sparsification (DMS)
What DMS Does
- Compresses the KV cache – the temporary key‑value memory that LLMs generate while processing prompts and reasoning through problems or documents.
- Discards redundant cache entries while preserving (and sometimes even improving) the model’s reasoning performance.
Why It Matters
- Longer “thinking” time – LLMs can explore more solution paths without hitting memory limits.
- No speed penalty – the compression is efficient enough that inference speed remains unchanged.
Key Takeaway
DMS shows that substantial memory savings are possible without degrading model intelligence, addressing a major bottleneck in scaling LLM reasoning.
Reference
- Paper: Dynamic Memory Sparsification – arXiv:2506.05345
The Bottleneck of Reasoning
LLMs improve their performance on complex tasks by generating chain‑of‑thought tokens—essentially writing out their reasoning steps before arriving at a final answer. Inference‑time scaling techniques leverage this by giving the model a larger budget to generate these “thinking” tokens or to explore multiple potential reasoning paths in parallel.
Why it hurts performance
- As the model generates more tokens, it builds up a key‑value (KV) cache.
- The KV cache grows linearly with the length of the reasoning chain, consuming large amounts of GPU memory.
- When memory pressure rises, the hardware spends more time reading data from memory than actually computing, which:
- Slows down generation and increases latency.
- Caps the number of concurrent users—running out of VRAM can crash the system or degrade it to a crawl.
“The question isn’t just about hardware quantity; it’s about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost.”
— Piotr Nawrot, Senior Deep Learning Engineer, Nvidia (as quoted by VentureBeat)
Prior attempts to mitigate the issue
| Approach | How it works | Drawbacks |
|---|---|---|
| Heuristic‑based eviction (e.g., sliding‑window) | Keeps only the most recent tokens in the KV cache, discarding older ones. | May delete critical information, hurting accuracy. |
| Standard eviction heuristics | Selects “old and unused” tokens for removal based on simple rules. | Relies on approximations of the model’s internal mechanics; can lead to incorrect answers. |
| Paging to slower memory | Offloads unused KV cache portions to host RAM or SSD. | Constant swapping introduces latency, making real‑time applications sluggish. |
References
- Chain‑of‑thought – VentureBeat: Don’t believe reasoning models? Chains of thought says Anthropic
- KV cache – VentureBeat: Mixture of Recursions delivers 2× faster inference – here’s how to implement it
Detailed Overview of Dynamic Memory Sparsification (DMS)
Dynamic Memory Sparsification (DMS) retrofits existing large language models (LLMs) so they can intelligently manage their own memory. Instead of applying a fixed rule for token deletion, DMS trains the model to recognize which tokens are essential for future reasoning and which are disposable.
“It doesn’t just guess importance; it learns a policy that explicitly preserves the model’s final output distribution,” — Nawrot
How DMS Works
| Step | Description |
|---|---|
| 1️⃣ Model selection | Start with a standard, pre‑trained LLM (e.g., Llama 3, Qwen 3). |
| 2️⃣ Freeze weights | Freeze the bulk of the model’s parameters (similar to LoRA) to keep training cheap. |
| 3️⃣ Add “keep/evict” heads | Repurpose neurons in the attention layers to output a binary signal for each token: keep or evict. |
| 4️⃣ Train a lightweight policy | Run a short fine‑tuning (≈ 1 000 steps) so the model learns a policy that predicts token importance. |
| 5️⃣ Deploy | The resulting model uses standard kernels and can be dropped into existing inference stacks without custom hardware. |
Key point: The process does not require training the model from scratch, which would be prohibitively expensive.
Delayed Eviction
Standard sparsification deletes a token the moment it is deemed unimportant, which can be risky because the model may still need a brief window to integrate that token’s context. DMS introduces delayed eviction:
- Flag a token for removal.
- Retain it in a short‑lived buffer (a few hundred steps).
- Allow the model to extract any remaining useful information and merge it into the current context.
- Evict the token from the KV cache after the window expires.
“The ‘delayed eviction’ mechanism is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between — they carry some information, but not enough to justify occupying an entire slot in memory,” — Nawrot.
Performance Highlights
- Training cost: ~1 000 steps of fine‑tuning (a tiny fraction of the original pre‑training compute).
- Speed: A Qwen‑3‑8B model can be retrofitted on a single DGX H100 in hours.
- Compatibility: Uses standard kernels; no custom hardware or extensive software rewrites required.
Visual Overview

Takeaway
DMS offers a lightweight, retrofittable solution for extending the context window of existing LLMs. By learning a token‑importance policy and employing delayed eviction, it preserves crucial information while freeing up memory, all without the massive cost of training a new model from scratch.
DMS in Action
To validate the technique, the researchers applied Dynamic Memory Scaling (DMS) to several reasoning models, including the Qwen‑R1 series (distilled from DeepSeek R1) and Llama 3.2. They evaluated the models on challenging benchmarks such as AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).
Key Findings
| Benchmark | Model (with DMS) | Baseline (no DMS) | Δ Score / Throughput |
|---|---|---|---|
| AIME 24 (math) | Qwen‑R1 32B | Standard Qwen‑R1 32B (same memory‑bandwidth budget) | +12.0 points |
| Needle‑in‑a‑Haystack (long‑context retrieval) | DMS‑enabled variants | Standard models | Higher retrieval accuracy |
| Enterprise throughput (Qwen‑3 8B) | DMS‑enabled | Vanilla Qwen‑3 8B | ≈ 5× higher throughput (same accuracy) |
How DMS Helps
- Deeper & wider reasoning: By compressing the cache, the model can “think” more extensively within the same memory and compute budget.
- Cleaner context: Active memory management prevents the accumulation of noisy tokens, which benefits long‑context tasks.
- Hardware efficiency: A smaller memory cache reduces GPU fetch latency, translating into faster query handling and lower hardware costs.
Visual Summary

Implications for Enterprise Deployments
- Throughput boost: A single server can handle up to 5× more queries per second without sacrificing quality.
- Cost savings: Reduced memory bandwidth and GPU idle time lower operational expenses.
- Scalability: Smaller cache footprints enable higher model density per GPU, facilitating larger deployments on existing hardware.
Overall, DMS demonstrates that intelligent memory management can deliver substantial gains in both model performance and system efficiency, challenging the conventional belief that compression inevitably harms long‑context understanding.
The Future of Memory
Nvidia has released DMS as part of its KV‑Press library. Regarding how enterprises can get started with DMS, Nawrot emphasized that the barrier to entry is low:
“The minimum viable infrastructure is standard Hugging Face pipelines — no custom CUDA kernels are required,”
— Nawrot, noting that the code is fully compatible with standard FlashAttention.
Key Takeaways
- Low entry barrier – Use existing Hugging Face pipelines; no need for custom CUDA kernels.
- Compatibility – Works out‑of‑the‑box with FlashAttention and newer architectures such as the Multi‑Head Latent Attention (MLA) used in DeepSeek’s models.
- Future vision – DMS is seen as a distinct, intelligent layer of the AI stack, enabling more efficient memory management.
Looking Ahead
- Integration with MLA – Combining DMS with MLA could yield even greater efficiency gains.
- Scaling agentic systems – As enterprises shift from simple chatbots to complex, reasoning‑heavy agents, inference cost becomes a primary concern.
- Sustainable scaling – Techniques like DMS provide a path to scale these capabilities sustainably.
“We’ve barely scratched the surface of what is possible,” Nawrot said. “We expect inference‑time scaling to further evolve.”