[Paper] Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving
Source: arXiv - 2512.17077v1
Overview
Diffusion Large Language Models (dLLMs) promise faster, parallel text generation compared with traditional autoregressive models, but they hit a “memory footprint crisis” when deployed at scale. The paper introduces dLLM‑Serve, a production‑ready serving system that tames the memory spikes and uneven compute‑bandwidth demands of diffusion inference, delivering higher throughput and lower tail latency on both consumer‑grade and server‑grade GPUs.
Key Contributions
- Memory‑aware tensor decomposition – Logit‑Aware Activation Budgeting breaks down massive, short‑lived logit tensors into smaller pieces that fit comfortably in GPU memory.
- Phase‑aware scheduling – The Phase‑Multiplexed Scheduler interleaves the compute‑heavy “Refresh” phase with the bandwidth‑bound “Reuse” phase across multiple requests, smoothing resource usage.
- Sparse attention redesign – Head‑Centric Sparse Attention separates logical sparsity (which heads attend to which tokens) from physical memory layout, enabling efficient storage and retrieval.
- End‑to‑end system prototype – Integrated all three techniques into a unified serving stack (dLLM‑Serve) and released the code for reproducibility.
- Comprehensive evaluation – Demonstrated 1.6×–1.8× throughput gains and up to 4× reduction in tail latency on RTX 4090 and NVIDIA L40S GPUs across realistic workloads (LiveBench, Burst, OSC).
Methodology
- Profiling the diffusion pipeline – The authors instrumented a reference dLLM implementation to expose two distinct phases:
- Refresh: recompute the diffusion state (compute‑bound).
- Reuse: reuse previously computed activations to generate the next token (bandwidth‑bound).
- Logit‑Aware Activation Budgeting – Instead of allocating a single monolithic buffer for the entire logit tensor, the system predicts the peak activation size per head and dynamically partitions memory, releasing buffers as soon as a phase finishes.
- Phase‑Multiplexed Scheduler – Requests are queued per phase. The scheduler packs together multiple “Refresh” tasks followed by a batch of “Reuse” tasks, ensuring the GPU’s compute units stay busy while the memory bus is not saturated.
- Head‑Centric Sparse Attention – The attention matrix is stored per head with a compact index that maps logical sparsity patterns to physical memory blocks, avoiding the need to materialize full dense tensors.
- Implementation – Built on top of PyTorch/CUDA, with custom kernels for the sparse attention and a lightweight runtime that orchestrates the phase multiplexing.
Results & Findings
| GPU | Workload | Throughput (tokens / s) | Speed‑up vs. baseline | Tail latency (95th pct) |
|---|---|---|---|---|
| RTX 4090 | LiveBench | 1.81× | 1.81× | ↓ ≈ 4× |
| RTX 4090 | Burst | 1.73× | 1.73× | ↓ ≈ 3.8× |
| L40S | OSC | 1.60× | 1.60× | ↓ ≈ 4× |
- Memory usage dropped by ~30 % on average thanks to activation budgeting.
- GPU utilization stayed above 85 % across all phases, whereas the baseline oscillated between 30 % (Refresh) and 70 % (Reuse).
- Generation quality (BLEU / ROUGE) remained statistically indistinguishable from the baseline, confirming that the sparsity tricks did not degrade model output.
Practical Implications
- Cost‑effective scaling – Developers can run dLLMs on cheaper consumer GPUs (RTX 4090) with server‑grade performance, reducing cloud spend.
- Higher concurrency – Phase multiplexing lets a single GPU serve many more simultaneous chat or completion requests without hitting OOM errors.
- Simplified deployment – The memory‑budgeting logic abstracts away low‑level tensor management, making it easier to integrate dLLMs into existing inference stacks (e.g., Triton, vLLM).
- Real‑time applications – The dramatic tail‑latency reduction opens the door for latency‑sensitive use‑cases such as interactive coding assistants or live translation.
Limitations & Future Work
- Hardware specificity – Optimizations are tuned for NVIDIA GPUs; porting to AMD or specialized AI accelerators will require additional kernel work.
- Model‑agnosticity – The system assumes a diffusion‑based generation schedule; adapting it to hybrid models (e.g., diffusion + autoregressive finetuning) is non‑trivial.
- Dynamic workloads – While the scheduler handles static phase patterns well, highly irregular request patterns (e.g., variable token lengths) may still cause sub‑optimal packing.
- Future directions – Extending head‑centric sparsity to multi‑GPU sharding, automating activation budgeting via reinforcement learning, and exploring compiler‑level support for diffusion‑specific kernels.
Authors
- Jiakun Fan
- Yanglin Zhang
- Xiangchen Li
- Dimitrios S. Nikolopoulos
Paper Information
- arXiv ID: 2512.17077v1
- Categories: cs.DC
- Published: December 18, 2025
- PDF: Download PDF