[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound
Source: arXiv - 2601.23278v1
Overview
Diffusion Large Language Models (DLLMs) promise higher-quality text generation than classic auto‑regressive LLMs, but their inference cost has kept them out of production pipelines. This paper uncovers a fundamental inefficiency in DLLM decoding and introduces FOCUS, a runtime system that dynamically concentrates compute on the tokens that actually need to be decoded, delivering up to 3.5× higher throughput without sacrificing output quality.
Key Contributions
- Identify the bottleneck: Show that, during each diffusion step, only a tiny fraction of tokens are decodable while the rest still consume GPU cycles.
- Correlation insight: Demonstrate a strong link between attention‑derived token importance scores and the probability that a token will be decoded at the next step.
- FOCUS inference engine: Design a dynamic scheduling algorithm that focuses GPU resources on decodable tokens and evicts the rest on‑the‑fly, effectively increasing the usable batch size.
- Open‑source implementation: Release a production‑ready library (compatible with LMDeploy) that can be dropped into existing DLLM serving stacks.
- Empirical validation: Achieve up to 3.52× throughput gains on standard benchmarks (e.g., WikiText, CommonGen) while maintaining or improving generation quality (BLEU, ROUGE, and human eval scores).
Methodology
- Profiling DLLM decoding: The authors instrumented a state‑of‑the‑art diffusion LLM to measure per‑token compute across diffusion steps. They observed that most GPU kernels processed tokens that were not yet ready for sampling.
- Attention‑based importance metric: By extracting the attention weights from the model’s internal layers, they derived a lightweight “importance score” for each token. Tokens with higher scores were far more likely to become decodable in the next diffusion iteration.
- Dynamic token selection: FOCUS maintains a priority queue of tokens sorted by importance. At each step it:
- Selects the top‑k tokens whose cumulative decoding probability exceeds a configurable threshold.
- Executes the diffusion kernels only on this subset.
- Re‑injects evicted tokens back into the queue once they become eligible.
- Batch‑size scaling: Because the active token set is much smaller, the same GPU can process more effective batches in parallel, boosting overall throughput.
- Integration with LMDeploy: The system wraps the existing inference engine, requiring only a few API changes, which simplifies adoption for existing services.
Results & Findings
| Metric | Baseline (LMDeploy) | FOCUS | Speed‑up | Quality Δ |
|---|---|---|---|---|
| Tokens/sec (WikiText) | 1,200 | 4,200 | 3.5× | ≈ 0% (BLEU) |
| Tokens/sec (CommonGen) | 950 | 3,300 | 3.5× | +0.3 BLEU |
| GPU Utilization | 68 % | 92 % | — | — |
| Latency (90‑pctile) | 210 ms | 78 ms | — | — |
- Throughput: Across five diverse generation tasks, FOCUS consistently delivered 2.8–3.5× higher token‑per‑second rates.
- Quality: No statistically significant drop in standard automatic metrics; in two cases, quality even improved, likely because the model spent more compute on “hard” tokens.
- Scalability: The system scales linearly with the number of GPUs, confirming that the dynamic focus does not introduce synchronization bottlenecks.
Practical Implications
- Cost‑effective serving: Cloud providers can run DLLMs at a fraction of the current compute budget, making diffusion‑based generation viable for chatbots, code assistants, and content creation services.
- Higher request concurrency: By increasing effective batch size, APIs can handle more simultaneous users without adding hardware, reducing latency spikes during traffic bursts.
- Energy savings: Focusing compute reduces wasted GPU cycles, aligning with sustainability goals for large‑scale AI deployments.
- Plug‑and‑play adoption: Since FOCUS is built as a thin wrapper around LMDeploy, teams can integrate it with minimal code changes, preserving existing model checkpoints and pipelines.
- Enabling new use‑cases: Faster DLLM inference opens the door for real‑time applications (e.g., interactive storytelling, on‑device generation) that previously required the slower auto‑regressive models.
Limitations & Future Work
- Model‑specific tuning: The importance‑based selection threshold is currently a hyper‑parameter that may need per‑model calibration; a universal setting is not yet proven.
- Memory overhead: Maintaining priority queues and token metadata adds a modest memory footprint, which could become a bottleneck on memory‑constrained edge devices.
- Generality to other diffusion architectures: The study focuses on a specific class of DLLMs; extending FOCUS to newer diffusion variants (e.g., latent diffusion for text) remains an open question.
- Adaptive scheduling research: Future work could explore reinforcement‑learning‑based token selection to further reduce latency and improve quality.
FOCUS demonstrates that smart runtime engineering can bridge the gap between cutting‑edge research models and real‑world production constraints, turning diffusion LLMs from a curiosity into a practical tool for developers.
Authors
- Kaihua Liang
- Xin Tan
- An Zhong
- Hong Xu
- Marco Canini
Paper Information
- arXiv ID: 2601.23278v1
- Categories: cs.LG, cs.AR, cs.CL
- Published: January 30, 2026
- PDF: Download PDF