[Paper] Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Source: arXiv - 2605.07985v1
Overview
The paper introduces Dooly, a new profiling framework that lets engineers simulate large‑language‑model (LLM) inference without having to re‑profile every possible hardware‑software configuration. By recognizing that many operation dimensions are fixed by the model itself and only a few are request‑specific, Dooly cuts the profiling effort in half while keeping latency estimates within a few percent of real runs.
Key Contributions
- Configuration‑agnostic profiling – a single inference pass can serve many model‑hardware‑engine combos.
- Redundancy‑aware latency database – Dooly records only new operation shapes, avoiding duplicate measurements.
- Taint‑propagation labeling – automatically tags each tensor dimension with its origin (model config vs. request), eliminating manual instrumentation.
- Stateful operation isolation – re‑uses the serving engine’s own initialization code to profile attention‑related kernels without extra code changes.
- Drop‑in backend – the generated latency regression models can replace the profiling layer of existing simulators with no API changes.
- Empirical validation – across 12 models, 2 GPU families, and 3 attention backends, Dooly reduces profiling GPU‑hours by 56 % and achieves ≤ 5 % MAPE for time‑to‑first‑token (TTFT) and ≤ 8 % for total‑prompt‑to‑output‑time (TPOT).
Methodology
- Single‑pass tracing – Dooly runs a representative inference request once, while a lightweight tracer records every tensor operation.
- Taint propagation – each tensor dimension is marked as either model‑derived (e.g., number of heads, hidden size) or request‑derived (e.g., batch size, sequence length). This creates a map from operation shape to its “origin vector.”
- Redundancy detection – before profiling an operation, Dooly checks its latency database. If an entry with the same origin vector already exists, the operation is skipped.
- Stateful kernel handling – for operations that keep internal state (like attention’s key/value caches), Dooly re‑executes the serving engine’s own initialization routine, capturing the true runtime without hand‑crafted hooks.
- Latency modeling – the collected data feed a regression model (e.g., linear or small neural net) that predicts latency as a function of the origin vector. The model is then queried by any simulator to estimate end‑to‑end performance for arbitrary configurations.
The whole pipeline is automated, requiring only a single “profile run” per model family rather than per hardware‑software combo.
Results & Findings
| Metric | Dooly vs. Baseline | Interpretation |
|---|---|---|
| Profiling GPU‑hours saved | 56.4 % reduction (12 models) | Less than half the compute cost for building a latency database |
| TTFT prediction error | ≤ 5 % MAPE | Near‑real‑time accuracy for the most latency‑sensitive metric |
| TPOT prediction error | ≤ 8 % MAPE | Good enough for capacity planning and SLA estimation |
| Platforms tested | NVIDIA A100, RTX 4090 | Demonstrates cross‑GPU applicability |
| Attention backends | FlashAttention‑2, Xformer, native CUDA | Shows robustness to different kernel implementations |
The authors also report that the regression models remain stable across minor software updates, meaning the profiling step does not need to be repeated for every driver or library patch.
Practical Implications
- Faster configuration search – Teams can now evaluate dozens of hardware‑engine‑model combos in minutes instead of days, accelerating the “right‑size‑your‑LLM” workflow.
- Cost‑effective capacity planning – Accurate TTFT/TPOT estimates let cloud operators provision GPU instances with tighter utilization targets, reducing wasted spend.
- Simplified tooling integration – Because Dooly plugs into existing simulators as a backend, developers can adopt it without rewriting their performance‑testing pipelines.
- Reduced engineering overhead – No need to write custom instrumentation for each new attention kernel or serving stack; Dooly’s taint‑propagation does the heavy lifting automatically.
- Enables “what‑if” analysis – Engineers can ask “What if we double the batch size but keep the same model?” and get reliable latency predictions instantly, supporting rapid A/B testing of API changes.
Limitations & Future Work
- Scope limited to inference – Training workloads, which involve backward passes and optimizer state, are not covered.
- Assumes deterministic kernels – Highly dynamic kernels (e.g., runtime‑generated PTX) may break the redundancy detection logic.
- Regression model simplicity – The current models are linear or shallow nets; more complex interactions (e.g., memory bandwidth contention) could benefit from richer models.
- Hardware diversity – Validation was performed on two GPU families; extending to TPUs, CPUs, or upcoming accelerator architectures remains an open question.
Future research directions include extending Dooly’s taint‑propagation to training pipelines, incorporating multi‑tenant interference models, and exploring automated model‑selection techniques that directly consume Dooly’s latency predictions.
Authors
- Joon Ha Kim
- Geon-Woo Kim
- Anoop Rachakonda
- Daehyeok Kim
Paper Information
- arXiv ID: 2605.07985v1
- Categories: cs.DC, cs.AI
- Published: May 8, 2026
- PDF: Download PDF