[Paper] Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows

Published: 1 month ago (November 25, 2025 at 09:05 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20975v1

Overview

Aragog tackles a pressing problem in today’s AI‑driven services: how to serve agentic workflows—multi‑step pipelines that stitch together several LLM calls—without blowing up compute costs or latency. By dynamically re‑routing each request to the most appropriate model configuration as the workflow runs, Aragog achieves dramatically higher throughput and lower latency while keeping answer quality on par with the most expensive static setups.

Key Contributions

Just‑in‑time configuration routing: Introduces a runtime‑aware scheduler that can switch model choices mid‑workflow based on current system load.
Two‑phase decoupling: Splits the problem into (1) a one‑time routing phase that enumerates all accuracy‑preserving configurations, and (2) a lightweight per‑stage scheduler that picks the best configuration on the fly.
Scalable acceleration techniques: Novel pruning and caching methods keep the routing phase tractable despite the combinatorial explosion of possible model assignments.
Empirical gains: Demonstrates 50–217 % higher peak throughput and 33–79 % lower median latency across a suite of real‑world workflows, with no measurable drop in output quality.

Methodology

Workflow Modeling – Each agentic workflow is expressed as a directed acyclic graph where nodes are LLM inference steps and edges capture data dependencies.
Configuration Space Generation – For every node, a set of candidate LLMs (different sizes, quantizations, or providers) is defined. The system first runs a static analysis to prune configurations that would degrade task accuracy beyond a user‑specified threshold.
One‑Time Routing – Using the pruned space, Aragog builds a compact lookup table of feasible end‑to‑end configurations. This step runs once per workflow deployment and leverages heuristics (e.g., dominance filtering) to keep the table small.
Per‑Stage Scheduler – At runtime, a lightweight controller monitors CPU/GPU utilization, queue lengths, and latency budgets. Before each node executes, the scheduler selects the cheapest configuration from the lookup table that still satisfies the current resource constraints. If the system load spikes, the scheduler can swap a high‑cost, high‑accuracy model for a cheaper alternative on the fly.
Feedback Loop – Execution metrics are fed back to continuously update the scheduler’s cost model, ensuring decisions stay optimal as workloads evolve.

Results & Findings

Metric	Baseline (static config)	Aragog	Improvement
Peak throughput	1,000 req/s	1,500–2,170 req/s	+50 % to +217 %
Median latency (90 % load)	1.2 s	0.26–0.81 s	–33 % to –79 %
Task accuracy	Highest‑cost static config	Same as highest‑cost config	≈ 0 % loss

Robustness to load swings: When request rates doubled mid‑execution, Aragog automatically migrated stages to lighter models, preventing queue buildup.
Model‑agnostic: Experiments spanned OpenAI, Anthropic, and open‑source LLM families, confirming the approach works across heterogeneous back‑ends.
Negligible overhead: The per‑stage scheduler adds < 2 ms of decision latency, far outweighed by the gains in inference time.

Practical Implications

Cost‑effective scaling: Cloud providers and SaaS platforms can run more concurrent agentic sessions on the same hardware budget, reducing OPEX.
Dynamic SLAs: Services can guarantee latency targets even under bursty traffic by swapping to cheaper models only when needed, then reverting to high‑accuracy models during idle periods.
Simplified ops: Engineers no longer need to manually tune per‑workflow model assignments; Aragog’s automated routing handles the heavy lifting.
Broader adoption of agentic pipelines: Lower latency and cost barriers make it feasible to embed multi‑step LLM reasoning in real‑time products such as code assistants, conversational agents, and autonomous data pipelines.

Limitations & Future Work

Static routing cost: The one‑time routing phase, while amortized, can be expensive for extremely large workflows with hundreds of nodes; smarter incremental updates are needed.
Accuracy estimation: The current pruning relies on offline benchmarks; integrating online quality monitoring could further tighten the trade‑off.
Hardware heterogeneity: Experiments focused on GPU‑centric clusters; extending the scheduler to CPUs, TPUs, and edge devices remains an open challenge.
Multi‑tenant fairness: Future versions should consider fairness across tenants when competing for shared model resources.

Aragog demonstrates that “just‑in‑time” model routing is a practical path to scaling sophisticated LLM‑driven applications, offering developers a powerful new lever to balance cost, latency, and quality in production environments.

Authors

Yinwei Dai
Zhuofu Chen
Anand Iyer
Ravi Netravali

Paper Information

arXiv ID: 2511.20975v1
Categories: cs.DC
Published: November 26, 2025
PDF: Download PDF

[Paper] Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

AI agents find $4.6M in blockchain smart contract exploits

Apple AI chief steps down following Siri setbacks

Apple AI Chief Retiring After Siri Failure