[Paper] Software-Defined Agentic Serving

Published: (January 6, 2026 at 12:22 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.03197v1

Overview

The paper introduces Software-Defined Agentic Serving (SDAS), a new framework for running multi‑agent LLM pipelines that treats the serving layer like a software‑defined network. By exposing a programmable control plane, SDAS lets developers dynamically adjust how agents talk to each other based on real‑time load, latency, and task‑specific cues—something traditional static serving stacks can’t do.

Key Contributions

  • SDN‑inspired architecture for LLM agents – separates a control plane (policy, routing, scaling) from a data plane (actual LLM inference), enabling on‑the‑fly reconfiguration.
  • Declarative intent language – developers can express high‑level goals (e.g., “minimize latency for user‑facing queries” or “prioritize accuracy for compliance checks”) and let the system translate them into concrete serving actions.
  • Dynamic communication control – runtime‑aware routing of messages between agents, automatic load‑balancing, and adaptive batching based on current resource utilization.
  • Prototype implementation and benchmark suite – built on top of popular LLM serving stacks (vLLM, TGI) and evaluated on realistic multi‑agent workflows (question answering, tool‑augmented reasoning, autonomous code generation).
  • Demonstrated performance gains – up to 2.3× reduction in end‑to‑end latency and 30 % lower GPU memory footprint compared with static pipelines.

Methodology

  1. System Design – The authors model the serving stack as a graph where nodes are LLM agents (or tool‑calling services) and edges represent communication channels. A controller watches metrics (GPU utilization, queue lengths, request priorities) and pushes policies to switches that sit in front of each agent.
  2. Policy Language – A lightweight DSL lets engineers declare constraints (e.g., “max‑latency < 200 ms”) and preferences (e.g., “use cheaper model for low‑risk steps”). The controller compiles these into routing tables and batching rules.
  3. Runtime Adaptation – Using a feedback loop, the controller periodically samples telemetry, runs a lightweight optimizer (linear programming or rule‑based heuristics), and updates the data‑plane without restarting services.
  4. Evaluation – The prototype runs three representative pipelines: (a) multi‑turn QA with retrieval, (b) tool‑augmented planning (code generation + execution sandbox), and (c) autonomous agents for web‑task automation. Each workload is tested under varying request rates and GPU budgets, comparing SDAS against a baseline static orchestrator.

Results & Findings

MetricBaseline (static)SDAS (dynamic)Improvement
99‑th‑percentile latency420 ms180 ms2.3× faster
Average GPU memory usage12 GB8.4 GB30 % reduction
Throughput (queries / s)4562~38 % increase
Policy compliance (latency SLA met)78 %96 %+18 pp

Key takeaways

  • Adaptive batching cuts idle GPU cycles, especially when request patterns are bursty.
  • Dynamic routing prevents hot‑spots; agents that become overloaded are automatically off‑loaded to spare replicas.
  • The intent‑driven DSL lets non‑ML engineers tweak serving behavior without touching low‑level code.

Practical Implications

  • Faster user experiences for AI‑powered products (chatbots, code assistants) because the serving layer can react instantly to traffic spikes or latency spikes.
  • Cost savings: By shrinking memory footprints and improving GPU utilization, cloud‑based LLM services can run more workloads per GPU, lowering operational expenses.
  • Simplified ops: Teams can encode business‑level SLAs (e.g., “high‑accuracy for finance queries”) in the DSL, letting the system enforce them automatically—reducing the need for manual tuning.
  • Extensibility: The SDAS model can be layered on top of existing serving frameworks (Ray Serve, vLLM, TGI), making it a drop‑in upgrade for organizations already running multi‑agent pipelines.

Limitations & Future Work

  • Prototype scope – The current implementation targets single‑node GPU clusters; scaling the control plane across multi‑node data centers remains an open challenge.
  • Policy language expressiveness – While the DSL covers common latency/accuracy constraints, more complex QoS policies (e.g., fairness across tenants) need richer semantics.
  • Security considerations – Dynamic routing could expose agents to unintended traffic; the authors note the need for robust authentication and sandboxing.
  • Future directions include distributed controller design, integration with container orchestration (Kubernetes), and exploring reinforcement‑learning‑based policy optimization for even finer‑grained adaptation.

Authors

  • Saurabh Agarwal
  • Marco Laju
  • Jayanth Srinivasa
  • Myungjin Lee
  • Aditya Akella

Paper Information

  • arXiv ID: 2601.03197v1
  • Categories: cs.DC, cs.MA
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »