[Paper] DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation
Source: arXiv - 2601.03178v1
Overview
Diffusion models are the backbone of today’s high‑fidelity image and video generators, but their multi‑step inference pipelines make them painfully slow for production use. The paper DiffBench Meets DiffAgent tackles this bottleneck by marrying two trends:
- a systematic benchmark (DiffBench) that measures how well different acceleration tricks work together, and
- an LLM‑powered “agent” (DiffAgent) that automatically writes, tests, and refines code to speed up any diffusion model.
The result is a reproducible, end‑to‑end pipeline that can turn a vanilla diffusion model into a production‑ready, low‑latency service with minimal human effort.
Key Contributions
-
DiffBench: A unified benchmark covering a wide range of diffusion architectures (e.g., UNet, Transformer‑based), hardware back‑ends (GPU, CPU, edge accelerators), and acceleration techniques (pruning, quantization, knowledge distillation, scheduler tweaks). It provides a three‑stage automated evaluation pipeline:
- code generation,
- functional correctness testing, and
- performance profiling.
-
DiffAgent: An LLM‑driven autonomous agent that iteratively proposes acceleration strategies, generates the corresponding Python/C++ code, runs it, and uses a genetic‑algorithm‑style feedback loop to evolve better solutions. The agent consists of:
- Planner – selects promising technique combinations based on model metadata.
- Code Generator – prompts a large language model (e.g., GPT‑4) to emit implementation snippets.
- Debugger – parses runtime errors and feeds them back to the planner.
- Genetic Optimizer – treats each generated script as an individual, mutates/recombines them, and selects the highest‑throughput candidates.
-
Closed‑Loop Evaluation: The entire workflow runs without manual intervention, enabling rapid prototyping of acceleration pipelines for new diffusion models.
-
Empirical Validation: Across 12 diffusion models and 7 hardware setups, DiffAgent consistently outperforms baseline LLM prompts and hand‑crafted acceleration scripts, achieving up to 3.2× speed‑up with < 1 % quality degradation.
Methodology
1. Benchmark Construction (DiffBench)
- Curated a dataset of 12 open‑source diffusion models spanning text‑to‑image, video, and super‑resolution tasks.
- Implemented wrappers for 9 popular acceleration primitives (e.g., TensorRT INT8, ONNX Runtime, weight pruning).
- Defined three evaluation stages:
- Correctness: Verify that the accelerated model produces outputs within a preset PSNR/LPIPS tolerance.
- Performance: Measure latency, throughput, and memory footprint on each target device.
- Robustness: Run a stress test with varied batch sizes and random seeds.
2. Agent Design (DiffAgent)
- Planning: The agent extracts model characteristics (layer types, parameter counts) and consults a knowledge base of technique compatibilities.
- Code Generation: It crafts a prompt that includes the model’s API, desired speed‑up target, and hardware constraints, then feeds this to an LLM. The LLM returns a self‑contained script (often a mix of PyTorch, TorchScript, and custom CUDA kernels).
- Debugging & Feedback: Execution logs are parsed for errors (e.g., missing operators, shape mismatches). The debugger rewrites the prompt with corrective hints.
- Genetic Optimization: Each script is treated as a genome; mutation operators randomly toggle techniques (e.g., switch from FP16 to INT8). A fitness function combines latency gain and quality loss. Over several generations, the agent converges on a high‑performing solution.
3. Evaluation Loop
- The generated code is automatically compiled, loaded, and benchmarked via DiffBench.
- Results are fed back to the genetic optimizer, which decides whether to keep, discard, or mutate the candidate.
Results & Findings
| Model (Task) | Baseline Latency (ms) | DiffAgent Latency (ms) | Speed‑up | Quality Δ (LPIPS) |
|---|---|---|---|---|
| StableDiffusion‑v1.5 (text‑to‑image) | 1200 | 380 | 3.2× | +0.006 |
| VideoDiffusion‑2 (16‑frame video) | 5400 | 1700 | 3.2× | +0.009 |
| Real‑ESRGAN (super‑resolution) | 850 | 280 | 3.0× | +0.004 |
- Higher‑order combos win: The best scripts combined operator fusion + mixed‑precision + kernel‑level pruning.
- Genetic feedback matters: Pure LLM prompting without the evolutionary loop plateaued at ~1.5× speed‑up.
- Hardware‑aware tuning: On edge GPUs (e.g., Jetson Nano), the agent learned to favor INT8 quantization and aggressive kernel tiling, achieving a 2.4× gain while staying within the device’s memory budget.
Practical Implications
- Rapid Deployment: Teams can feed a new diffusion checkpoint into DiffAgent and obtain a production‑ready, optimized inference script in under an hour—dramatically shortening the “research‑to‑product” cycle.
- Cost Savings: Faster inference translates directly to lower cloud GPU bills. A 3× speed‑up on a typical Stable Diffusion service can cut monthly compute spend by ~30 %.
- Edge AI Enablement: The framework’s hardware‑aware component makes it feasible to run diffusion models on edge devices (mobile, AR/VR headsets) that previously could only host lightweight classifiers.
- Standardized Evaluation: DiffBench can serve as a community reference for comparing new acceleration libraries (e.g., NVIDIA’s FasterTransformer, Intel’s OpenVINO) under identical conditions.
Limitations & Future Work
- LLM Dependency: The quality of generated code hinges on the underlying LLM; older or smaller models may produce non‑compilable scripts, increasing the debugging burden.
- Search Space Explosion: The genetic algorithm explores a combinatorial space of techniques; while effective for the evaluated models, scaling to dozens of techniques may require more sophisticated search heuristics (e.g., reinforcement learning).
- Quality Metric Scope: The paper focuses on LPIPS/PSNR; other downstream metrics (e.g., CLIP similarity for text‑to‑image) were not evaluated, which could affect perceived quality in some applications.
- Security & Safety: Automatically generated CUDA kernels could inadvertently introduce memory‑safety bugs; future versions should integrate static analysis or sandboxed execution.
Overall, DiffBench and DiffAgent illustrate a compelling direction: using LLMs not just for code completion, but for end‑to‑end system optimization, turning the once‑manual art of diffusion acceleration into an automated, reproducible workflow.
Authors
- Jiajun jiao
- Haowei Zhu
- Puyuan Yang
- Jianghui Wang
- Ji Liu
- Ziqiong Liu
- Dong Li
- Yuejian Fang
- Junhai Yong
- Bin Wang
- Emad Barsoum
Paper Information
- arXiv ID: 2601.03178v1
- Categories: cs.CV
- Published: January 6, 2026
- PDF: Download PDF