[Paper] Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Published: 1 month ago (December 18, 2025 at 04:18 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.17077v1

Overview

Diffusion Large Language Models (dLLMs) promise faster, parallel text generation compared with traditional autoregressive models, but they hit a “memory footprint crisis” when deployed at scale. The paper introduces dLLM‑Serve, a production‑ready serving system that tames the memory spikes and uneven compute‑bandwidth demands of diffusion inference, delivering higher throughput and lower tail latency on both consumer‑grade and server‑grade GPUs.

Key Contributions

Memory‑aware tensor decomposition – Logit‑Aware Activation Budgeting breaks down massive, short‑lived logit tensors into smaller pieces that fit comfortably in GPU memory.
Phase‑aware scheduling – The Phase‑Multiplexed Scheduler interleaves the compute‑heavy “Refresh” phase with the bandwidth‑bound “Reuse” phase across multiple requests, smoothing resource usage.
Sparse attention redesign – Head‑Centric Sparse Attention separates logical sparsity (which heads attend to which tokens) from physical memory layout, enabling efficient storage and retrieval.
End‑to‑end system prototype – Integrated all three techniques into a unified serving stack (dLLM‑Serve) and released the code for reproducibility.
Comprehensive evaluation – Demonstrated 1.6×–1.8× throughput gains and up to 4× reduction in tail latency on RTX 4090 and NVIDIA L40S GPUs across realistic workloads (LiveBench, Burst, OSC).

Methodology

Profiling the diffusion pipeline – The authors instrumented a reference dLLM implementation to expose two distinct phases:
- Refresh: recompute the diffusion state (compute‑bound).
- Reuse: reuse previously computed activations to generate the next token (bandwidth‑bound).
Logit‑Aware Activation Budgeting – Instead of allocating a single monolithic buffer for the entire logit tensor, the system predicts the peak activation size per head and dynamically partitions memory, releasing buffers as soon as a phase finishes.
Phase‑Multiplexed Scheduler – Requests are queued per phase. The scheduler packs together multiple “Refresh” tasks followed by a batch of “Reuse” tasks, ensuring the GPU’s compute units stay busy while the memory bus is not saturated.
Head‑Centric Sparse Attention – The attention matrix is stored per head with a compact index that maps logical sparsity patterns to physical memory blocks, avoiding the need to materialize full dense tensors.
Implementation – Built on top of PyTorch/CUDA, with custom kernels for the sparse attention and a lightweight runtime that orchestrates the phase multiplexing.

Results & Findings

GPU	Workload	Throughput (tokens / s)	Speed‑up vs. baseline	Tail latency (95th pct)
RTX 4090	LiveBench	1.81×	1.81×	↓ ≈ 4×
RTX 4090	Burst	1.73×	1.73×	↓ ≈ 3.8×
L40S	OSC	1.60×	1.60×	↓ ≈ 4×

Memory usage dropped by ~30 % on average thanks to activation budgeting.
GPU utilization stayed above 85 % across all phases, whereas the baseline oscillated between 30 % (Refresh) and 70 % (Reuse).
Generation quality (BLEU / ROUGE) remained statistically indistinguishable from the baseline, confirming that the sparsity tricks did not degrade model output.

Practical Implications

Cost‑effective scaling – Developers can run dLLMs on cheaper consumer GPUs (RTX 4090) with server‑grade performance, reducing cloud spend.
Higher concurrency – Phase multiplexing lets a single GPU serve many more simultaneous chat or completion requests without hitting OOM errors.
Simplified deployment – The memory‑budgeting logic abstracts away low‑level tensor management, making it easier to integrate dLLMs into existing inference stacks (e.g., Triton, vLLM).
Real‑time applications – The dramatic tail‑latency reduction opens the door for latency‑sensitive use‑cases such as interactive coding assistants or live translation.

Limitations & Future Work

Hardware specificity – Optimizations are tuned for NVIDIA GPUs; porting to AMD or specialized AI accelerators will require additional kernel work.
Model‑agnosticity – The system assumes a diffusion‑based generation schedule; adapting it to hybrid models (e.g., diffusion + autoregressive finetuning) is non‑trivial.
Dynamic workloads – While the scheduler handles static phase patterns well, highly irregular request patterns (e.g., variable token lengths) may still cause sub‑optimal packing.
Future directions – Extending head‑centric sparsity to multi‑GPU sharding, automating activation budgeting via reinforcement learning, and exploring compiler‑level support for diffusion‑specific kernels.

Authors

Jiakun Fan
Yanglin Zhang
Xiangchen Li
Dimitrios S. Nikolopoulos

Paper Information

arXiv ID: 2512.17077v1
Categories: cs.DC
Published: December 18, 2025
PDF: Download PDF

[Paper] Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Asymptotic behaviour of galactic small-scale dynamos at modest magnetic Prandtl number

[Paper] Torrent: A Distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement

[Paper] The HEAL Data Platform

[Paper] Democratizing Scalable Cloud Applications: Transactional Stateful Functions on Streaming Dataflows