[Paper] Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Published: 3 days ago (May 8, 2026 at 12:44 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.07985v1

Overview

The paper introduces Dooly, a new profiling framework that lets engineers simulate large‑language‑model (LLM) inference without having to re‑profile every possible hardware‑software configuration. By recognizing that many operation dimensions are fixed by the model itself and only a few are request‑specific, Dooly cuts the profiling effort in half while keeping latency estimates within a few percent of real runs.

Key Contributions

Configuration‑agnostic profiling – a single inference pass can serve many model‑hardware‑engine combos.
Redundancy‑aware latency database – Dooly records only new operation shapes, avoiding duplicate measurements.
Taint‑propagation labeling – automatically tags each tensor dimension with its origin (model config vs. request), eliminating manual instrumentation.
Stateful operation isolation – re‑uses the serving engine’s own initialization code to profile attention‑related kernels without extra code changes.
Drop‑in backend – the generated latency regression models can replace the profiling layer of existing simulators with no API changes.
Empirical validation – across 12 models, 2 GPU families, and 3 attention backends, Dooly reduces profiling GPU‑hours by 56 % and achieves ≤ 5 % MAPE for time‑to‑first‑token (TTFT) and ≤ 8 % for total‑prompt‑to‑output‑time (TPOT).

Methodology

Single‑pass tracing – Dooly runs a representative inference request once, while a lightweight tracer records every tensor operation.
Taint propagation – each tensor dimension is marked as either model‑derived (e.g., number of heads, hidden size) or request‑derived (e.g., batch size, sequence length). This creates a map from operation shape to its “origin vector.”
Redundancy detection – before profiling an operation, Dooly checks its latency database. If an entry with the same origin vector already exists, the operation is skipped.
Stateful kernel handling – for operations that keep internal state (like attention’s key/value caches), Dooly re‑executes the serving engine’s own initialization routine, capturing the true runtime without hand‑crafted hooks.
Latency modeling – the collected data feed a regression model (e.g., linear or small neural net) that predicts latency as a function of the origin vector. The model is then queried by any simulator to estimate end‑to‑end performance for arbitrary configurations.

The whole pipeline is automated, requiring only a single “profile run” per model family rather than per hardware‑software combo.

Results & Findings

Metric	Dooly vs. Baseline	Interpretation
Profiling GPU‑hours saved	56.4 % reduction (12 models)	Less than half the compute cost for building a latency database
TTFT prediction error	≤ 5 % MAPE	Near‑real‑time accuracy for the most latency‑sensitive metric
TPOT prediction error	≤ 8 % MAPE	Good enough for capacity planning and SLA estimation
Platforms tested	NVIDIA A100, RTX 4090	Demonstrates cross‑GPU applicability
Attention backends	FlashAttention‑2, Xformer, native CUDA	Shows robustness to different kernel implementations

The authors also report that the regression models remain stable across minor software updates, meaning the profiling step does not need to be repeated for every driver or library patch.

Practical Implications

Faster configuration search – Teams can now evaluate dozens of hardware‑engine‑model combos in minutes instead of days, accelerating the “right‑size‑your‑LLM” workflow.
Cost‑effective capacity planning – Accurate TTFT/TPOT estimates let cloud operators provision GPU instances with tighter utilization targets, reducing wasted spend.
Simplified tooling integration – Because Dooly plugs into existing simulators as a backend, developers can adopt it without rewriting their performance‑testing pipelines.
Reduced engineering overhead – No need to write custom instrumentation for each new attention kernel or serving stack; Dooly’s taint‑propagation does the heavy lifting automatically.
Enables “what‑if” analysis – Engineers can ask “What if we double the batch size but keep the same model?” and get reliable latency predictions instantly, supporting rapid A/B testing of API changes.

Limitations & Future Work

Scope limited to inference – Training workloads, which involve backward passes and optimizer state, are not covered.
Assumes deterministic kernels – Highly dynamic kernels (e.g., runtime‑generated PTX) may break the redundancy detection logic.
Regression model simplicity – The current models are linear or shallow nets; more complex interactions (e.g., memory bandwidth contention) could benefit from richer models.
Hardware diversity – Validation was performed on two GPU families; extending to TPUs, CPUs, or upcoming accelerator architectures remains an open question.

Future research directions include extending Dooly’s taint‑propagation to training pipelines, incorporating multi‑tenant interference models, and exploring automated model‑selection techniques that directly consume Dooly’s latency predictions.

Authors

Joon Ha Kim
Geon-Woo Kim
Anoop Rachakonda
Daehyeok Kim

Paper Information

arXiv ID: 2605.07985v1
Categories: cs.DC, cs.AI
Published: May 8, 2026
PDF: Download PDF

[Paper] Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction