[Paper] Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows
Source: arXiv - 2511.20975v1
Overview
Aragog tackles a pressing problem in today’s AI‑driven services: how to serve agentic workflows—multi‑step pipelines that stitch together several LLM calls—without blowing up compute costs or latency. By dynamically re‑routing each request to the most appropriate model configuration as the workflow runs, Aragog achieves dramatically higher throughput and lower latency while keeping answer quality on par with the most expensive static setups.
Key Contributions
- Just‑in‑time configuration routing: Introduces a runtime‑aware scheduler that can switch model choices mid‑workflow based on current system load.
- Two‑phase decoupling: Splits the problem into (1) a one‑time routing phase that enumerates all accuracy‑preserving configurations, and (2) a lightweight per‑stage scheduler that picks the best configuration on the fly.
- Scalable acceleration techniques: Novel pruning and caching methods keep the routing phase tractable despite the combinatorial explosion of possible model assignments.
- Empirical gains: Demonstrates 50–217 % higher peak throughput and 33–79 % lower median latency across a suite of real‑world workflows, with no measurable drop in output quality.
Methodology
- Workflow Modeling – Each agentic workflow is expressed as a directed acyclic graph where nodes are LLM inference steps and edges capture data dependencies.
- Configuration Space Generation – For every node, a set of candidate LLMs (different sizes, quantizations, or providers) is defined. The system first runs a static analysis to prune configurations that would degrade task accuracy beyond a user‑specified threshold.
- One‑Time Routing – Using the pruned space, Aragog builds a compact lookup table of feasible end‑to‑end configurations. This step runs once per workflow deployment and leverages heuristics (e.g., dominance filtering) to keep the table small.
- Per‑Stage Scheduler – At runtime, a lightweight controller monitors CPU/GPU utilization, queue lengths, and latency budgets. Before each node executes, the scheduler selects the cheapest configuration from the lookup table that still satisfies the current resource constraints. If the system load spikes, the scheduler can swap a high‑cost, high‑accuracy model for a cheaper alternative on the fly.
- Feedback Loop – Execution metrics are fed back to continuously update the scheduler’s cost model, ensuring decisions stay optimal as workloads evolve.
Results & Findings
| Metric | Baseline (static config) | Aragog | Improvement |
|---|---|---|---|
| Peak throughput | 1,000 req/s | 1,500–2,170 req/s | +50 % to +217 % |
| Median latency (90 % load) | 1.2 s | 0.26–0.81 s | –33 % to –79 % |
| Task accuracy | Highest‑cost static config | Same as highest‑cost config | ≈ 0 % loss |
- Robustness to load swings: When request rates doubled mid‑execution, Aragog automatically migrated stages to lighter models, preventing queue buildup.
- Model‑agnostic: Experiments spanned OpenAI, Anthropic, and open‑source LLM families, confirming the approach works across heterogeneous back‑ends.
- Negligible overhead: The per‑stage scheduler adds < 2 ms of decision latency, far outweighed by the gains in inference time.
Practical Implications
- Cost‑effective scaling: Cloud providers and SaaS platforms can run more concurrent agentic sessions on the same hardware budget, reducing OPEX.
- Dynamic SLAs: Services can guarantee latency targets even under bursty traffic by swapping to cheaper models only when needed, then reverting to high‑accuracy models during idle periods.
- Simplified ops: Engineers no longer need to manually tune per‑workflow model assignments; Aragog’s automated routing handles the heavy lifting.
- Broader adoption of agentic pipelines: Lower latency and cost barriers make it feasible to embed multi‑step LLM reasoning in real‑time products such as code assistants, conversational agents, and autonomous data pipelines.
Limitations & Future Work
- Static routing cost: The one‑time routing phase, while amortized, can be expensive for extremely large workflows with hundreds of nodes; smarter incremental updates are needed.
- Accuracy estimation: The current pruning relies on offline benchmarks; integrating online quality monitoring could further tighten the trade‑off.
- Hardware heterogeneity: Experiments focused on GPU‑centric clusters; extending the scheduler to CPUs, TPUs, and edge devices remains an open challenge.
- Multi‑tenant fairness: Future versions should consider fairness across tenants when competing for shared model resources.
Aragog demonstrates that “just‑in‑time” model routing is a practical path to scaling sophisticated LLM‑driven applications, offering developers a powerful new lever to balance cost, latency, and quality in production environments.
Authors
- Yinwei Dai
- Zhuofu Chen
- Anand Iyer
- Ravi Netravali
Paper Information
- arXiv ID: 2511.20975v1
- Categories: cs.DC
- Published: November 26, 2025
- PDF: Download PDF