[Paper] Software-Defined Agentic Serving
Source: arXiv - 2601.03197v1
Overview
The paper introduces Software-Defined Agentic Serving (SDAS), a new framework for running multi‑agent LLM pipelines that treats the serving layer like a software‑defined network. By exposing a programmable control plane, SDAS lets developers dynamically adjust how agents talk to each other based on real‑time load, latency, and task‑specific cues—something traditional static serving stacks can’t do.
Key Contributions
- SDN‑inspired architecture for LLM agents – separates a control plane (policy, routing, scaling) from a data plane (actual LLM inference), enabling on‑the‑fly reconfiguration.
- Declarative intent language – developers can express high‑level goals (e.g., “minimize latency for user‑facing queries” or “prioritize accuracy for compliance checks”) and let the system translate them into concrete serving actions.
- Dynamic communication control – runtime‑aware routing of messages between agents, automatic load‑balancing, and adaptive batching based on current resource utilization.
- Prototype implementation and benchmark suite – built on top of popular LLM serving stacks (vLLM, TGI) and evaluated on realistic multi‑agent workflows (question answering, tool‑augmented reasoning, autonomous code generation).
- Demonstrated performance gains – up to 2.3× reduction in end‑to‑end latency and 30 % lower GPU memory footprint compared with static pipelines.
Methodology
- System Design – The authors model the serving stack as a graph where nodes are LLM agents (or tool‑calling services) and edges represent communication channels. A controller watches metrics (GPU utilization, queue lengths, request priorities) and pushes policies to switches that sit in front of each agent.
- Policy Language – A lightweight DSL lets engineers declare constraints (e.g., “max‑latency < 200 ms”) and preferences (e.g., “use cheaper model for low‑risk steps”). The controller compiles these into routing tables and batching rules.
- Runtime Adaptation – Using a feedback loop, the controller periodically samples telemetry, runs a lightweight optimizer (linear programming or rule‑based heuristics), and updates the data‑plane without restarting services.
- Evaluation – The prototype runs three representative pipelines: (a) multi‑turn QA with retrieval, (b) tool‑augmented planning (code generation + execution sandbox), and (c) autonomous agents for web‑task automation. Each workload is tested under varying request rates and GPU budgets, comparing SDAS against a baseline static orchestrator.
Results & Findings
| Metric | Baseline (static) | SDAS (dynamic) | Improvement |
|---|---|---|---|
| 99‑th‑percentile latency | 420 ms | 180 ms | 2.3× faster |
| Average GPU memory usage | 12 GB | 8.4 GB | 30 % reduction |
| Throughput (queries / s) | 45 | 62 | ~38 % increase |
| Policy compliance (latency SLA met) | 78 % | 96 % | +18 pp |
Key takeaways
- Adaptive batching cuts idle GPU cycles, especially when request patterns are bursty.
- Dynamic routing prevents hot‑spots; agents that become overloaded are automatically off‑loaded to spare replicas.
- The intent‑driven DSL lets non‑ML engineers tweak serving behavior without touching low‑level code.
Practical Implications
- Faster user experiences for AI‑powered products (chatbots, code assistants) because the serving layer can react instantly to traffic spikes or latency spikes.
- Cost savings: By shrinking memory footprints and improving GPU utilization, cloud‑based LLM services can run more workloads per GPU, lowering operational expenses.
- Simplified ops: Teams can encode business‑level SLAs (e.g., “high‑accuracy for finance queries”) in the DSL, letting the system enforce them automatically—reducing the need for manual tuning.
- Extensibility: The SDAS model can be layered on top of existing serving frameworks (Ray Serve, vLLM, TGI), making it a drop‑in upgrade for organizations already running multi‑agent pipelines.
Limitations & Future Work
- Prototype scope – The current implementation targets single‑node GPU clusters; scaling the control plane across multi‑node data centers remains an open challenge.
- Policy language expressiveness – While the DSL covers common latency/accuracy constraints, more complex QoS policies (e.g., fairness across tenants) need richer semantics.
- Security considerations – Dynamic routing could expose agents to unintended traffic; the authors note the need for robust authentication and sandboxing.
- Future directions include distributed controller design, integration with container orchestration (Kubernetes), and exploring reinforcement‑learning‑based policy optimization for even finer‑grained adaptation.
Authors
- Saurabh Agarwal
- Marco Laju
- Jayanth Srinivasa
- Myungjin Lee
- Aditya Akella
Paper Information
- arXiv ID: 2601.03197v1
- Categories: cs.DC, cs.MA
- Published: January 6, 2026
- PDF: Download PDF