[Paper] Software-Defined Agentic Serving

Published: 1 month ago (January 6, 2026 at 12:22 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.03197v1

Overview

The paper introduces Software-Defined Agentic Serving (SDAS), a new framework for running multi‑agent LLM pipelines that treats the serving layer like a software‑defined network. By exposing a programmable control plane, SDAS lets developers dynamically adjust how agents talk to each other based on real‑time load, latency, and task‑specific cues—something traditional static serving stacks can’t do.

Key Contributions

SDN‑inspired architecture for LLM agents – separates a control plane (policy, routing, scaling) from a data plane (actual LLM inference), enabling on‑the‑fly reconfiguration.
Declarative intent language – developers can express high‑level goals (e.g., “minimize latency for user‑facing queries” or “prioritize accuracy for compliance checks”) and let the system translate them into concrete serving actions.
Dynamic communication control – runtime‑aware routing of messages between agents, automatic load‑balancing, and adaptive batching based on current resource utilization.
Prototype implementation and benchmark suite – built on top of popular LLM serving stacks (vLLM, TGI) and evaluated on realistic multi‑agent workflows (question answering, tool‑augmented reasoning, autonomous code generation).
Demonstrated performance gains – up to 2.3× reduction in end‑to‑end latency and 30 % lower GPU memory footprint compared with static pipelines.

Methodology

System Design – The authors model the serving stack as a graph where nodes are LLM agents (or tool‑calling services) and edges represent communication channels. A controller watches metrics (GPU utilization, queue lengths, request priorities) and pushes policies to switches that sit in front of each agent.
Policy Language – A lightweight DSL lets engineers declare constraints (e.g., “max‑latency < 200 ms”) and preferences (e.g., “use cheaper model for low‑risk steps”). The controller compiles these into routing tables and batching rules.
Runtime Adaptation – Using a feedback loop, the controller periodically samples telemetry, runs a lightweight optimizer (linear programming or rule‑based heuristics), and updates the data‑plane without restarting services.
Evaluation – The prototype runs three representative pipelines: (a) multi‑turn QA with retrieval, (b) tool‑augmented planning (code generation + execution sandbox), and (c) autonomous agents for web‑task automation. Each workload is tested under varying request rates and GPU budgets, comparing SDAS against a baseline static orchestrator.

Results & Findings

Metric	Baseline (static)	SDAS (dynamic)	Improvement
99‑th‑percentile latency	420 ms	180 ms	2.3× faster
Average GPU memory usage	12 GB	8.4 GB	30 % reduction
Throughput (queries / s)	45	62	~38 % increase
Policy compliance (latency SLA met)	78 %	96 %	+18 pp

Key takeaways

Adaptive batching cuts idle GPU cycles, especially when request patterns are bursty.
Dynamic routing prevents hot‑spots; agents that become overloaded are automatically off‑loaded to spare replicas.
The intent‑driven DSL lets non‑ML engineers tweak serving behavior without touching low‑level code.

Practical Implications

Faster user experiences for AI‑powered products (chatbots, code assistants) because the serving layer can react instantly to traffic spikes or latency spikes.
Cost savings: By shrinking memory footprints and improving GPU utilization, cloud‑based LLM services can run more workloads per GPU, lowering operational expenses.
Simplified ops: Teams can encode business‑level SLAs (e.g., “high‑accuracy for finance queries”) in the DSL, letting the system enforce them automatically—reducing the need for manual tuning.
Extensibility: The SDAS model can be layered on top of existing serving frameworks (Ray Serve, vLLM, TGI), making it a drop‑in upgrade for organizations already running multi‑agent pipelines.

Limitations & Future Work

Prototype scope – The current implementation targets single‑node GPU clusters; scaling the control plane across multi‑node data centers remains an open challenge.
Policy language expressiveness – While the DSL covers common latency/accuracy constraints, more complex QoS policies (e.g., fairness across tenants) need richer semantics.
Security considerations – Dynamic routing could expose agents to unintended traffic; the authors note the need for robust authentication and sandboxing.
Future directions include distributed controller design, integration with container orchestration (Kubernetes), and exploring reinforcement‑learning‑based policy optimization for even finer‑grained adaptation.

Authors

Saurabh Agarwal
Marco Laju
Jayanth Srinivasa
Myungjin Lee
Aditya Akella

Paper Information

arXiv ID: 2601.03197v1
Categories: cs.DC, cs.MA
Published: January 6, 2026
PDF: Download PDF

[Paper] Software-Defined Agentic Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

[Paper] LACIN: Linearly Arranged Complete Interconnection Networks

[Paper] Self-Evolving Distributed Memory Architecture for Scalable AI Systems