[Paper] Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Published: (December 18, 2025 at 04:18 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17077v1

Overview

Diffusion Large Language Models (dLLMs) promise faster, parallel text generation compared with traditional autoregressive models, but they hit a “memory footprint crisis” when deployed at scale. The paper introduces dLLM‑Serve, a production‑ready serving system that tames the memory spikes and uneven compute‑bandwidth demands of diffusion inference, delivering higher throughput and lower tail latency on both consumer‑grade and server‑grade GPUs.

Key Contributions

  • Memory‑aware tensor decompositionLogit‑Aware Activation Budgeting breaks down massive, short‑lived logit tensors into smaller pieces that fit comfortably in GPU memory.
  • Phase‑aware scheduling – The Phase‑Multiplexed Scheduler interleaves the compute‑heavy “Refresh” phase with the bandwidth‑bound “Reuse” phase across multiple requests, smoothing resource usage.
  • Sparse attention redesignHead‑Centric Sparse Attention separates logical sparsity (which heads attend to which tokens) from physical memory layout, enabling efficient storage and retrieval.
  • End‑to‑end system prototype – Integrated all three techniques into a unified serving stack (dLLM‑Serve) and released the code for reproducibility.
  • Comprehensive evaluation – Demonstrated 1.6×–1.8× throughput gains and up to 4× reduction in tail latency on RTX 4090 and NVIDIA L40S GPUs across realistic workloads (LiveBench, Burst, OSC).

Methodology

  1. Profiling the diffusion pipeline – The authors instrumented a reference dLLM implementation to expose two distinct phases:
    • Refresh: recompute the diffusion state (compute‑bound).
    • Reuse: reuse previously computed activations to generate the next token (bandwidth‑bound).
  2. Logit‑Aware Activation Budgeting – Instead of allocating a single monolithic buffer for the entire logit tensor, the system predicts the peak activation size per head and dynamically partitions memory, releasing buffers as soon as a phase finishes.
  3. Phase‑Multiplexed Scheduler – Requests are queued per phase. The scheduler packs together multiple “Refresh” tasks followed by a batch of “Reuse” tasks, ensuring the GPU’s compute units stay busy while the memory bus is not saturated.
  4. Head‑Centric Sparse Attention – The attention matrix is stored per head with a compact index that maps logical sparsity patterns to physical memory blocks, avoiding the need to materialize full dense tensors.
  5. Implementation – Built on top of PyTorch/CUDA, with custom kernels for the sparse attention and a lightweight runtime that orchestrates the phase multiplexing.

Results & Findings

GPUWorkloadThroughput (tokens / s)Speed‑up vs. baselineTail latency (95th pct)
RTX 4090LiveBench1.81×1.81×↓ ≈ 4×
RTX 4090Burst1.73×1.73×↓ ≈ 3.8×
L40SOSC1.60×1.60×↓ ≈ 4×
  • Memory usage dropped by ~30 % on average thanks to activation budgeting.
  • GPU utilization stayed above 85 % across all phases, whereas the baseline oscillated between 30 % (Refresh) and 70 % (Reuse).
  • Generation quality (BLEU / ROUGE) remained statistically indistinguishable from the baseline, confirming that the sparsity tricks did not degrade model output.

Practical Implications

  • Cost‑effective scaling – Developers can run dLLMs on cheaper consumer GPUs (RTX 4090) with server‑grade performance, reducing cloud spend.
  • Higher concurrency – Phase multiplexing lets a single GPU serve many more simultaneous chat or completion requests without hitting OOM errors.
  • Simplified deployment – The memory‑budgeting logic abstracts away low‑level tensor management, making it easier to integrate dLLMs into existing inference stacks (e.g., Triton, vLLM).
  • Real‑time applications – The dramatic tail‑latency reduction opens the door for latency‑sensitive use‑cases such as interactive coding assistants or live translation.

Limitations & Future Work

  • Hardware specificity – Optimizations are tuned for NVIDIA GPUs; porting to AMD or specialized AI accelerators will require additional kernel work.
  • Model‑agnosticity – The system assumes a diffusion‑based generation schedule; adapting it to hybrid models (e.g., diffusion + autoregressive finetuning) is non‑trivial.
  • Dynamic workloads – While the scheduler handles static phase patterns well, highly irregular request patterns (e.g., variable token lengths) may still cause sub‑optimal packing.
  • Future directions – Extending head‑centric sparsity to multi‑GPU sharding, automating activation budgeting via reinforcement learning, and exploring compiler‑level support for diffusion‑specific kernels.

Authors

  • Jiakun Fan
  • Yanglin Zhang
  • Xiangchen Li
  • Dimitrios S. Nikolopoulos

Paper Information

  • arXiv ID: 2512.17077v1
  • Categories: cs.DC
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] The HEAL Data Platform

Objective: The objective was to develop a cloud-based, federated system to serve as a single point of search, discovery and analysis for data generated under th...