[Paper] Nalar: An agent serving framework

Published: 1 month ago (January 8, 2026 at 11:56 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05109v1

Overview

The paper introduces Nalar, a purpose‑built serving framework for large‑language‑model (LLM)‑driven agents. By cleanly separating what an agent workflow should do from how it is executed, Nalar lets developers write ordinary Python code while the system handles orchestration, state management, and latency‑aware scheduling. The result is a dramatic reduction in tail latency and higher throughput for complex, multi‑step AI applications.

Key Contributions

Unified workflow abstraction – agents and tools are expressed as ordinary Python functions; Nalar automatically generates lightweight future stubs that capture dependencies and execution context.
Managed state layer – logical state is decoupled from its physical placement, enabling safe reuse, migration, and deterministic retries without programmer‑level bookkeeping.
Two‑level control architecture – a global policy engine computes high‑level routing and resource policies, while local event‑driven controllers enforce them in real time.
Policy‑driven adaptive scheduling – supports dynamic routing, load‑balancing, and resource throttling based on observed latency, workload characteristics, and SLA constraints.
Scalable runtime – demonstrated ability to handle >130 K concurrent futures with sub‑500 ms control overhead, and to sustain 80 RPS where existing baselines collapse.

Methodology

Future‑based API – When a developer calls an agent or a tool, Nalar replaces the call with a future object that records the call’s inputs, required resources, and any downstream dependencies. The future is a lightweight placeholder that can be scheduled independently.
State abstraction – All mutable data lives in a managed state store. The store exposes a simple key‑value interface but internally tracks versioning and placement, allowing the runtime to move state across machines or retry operations without corrupting user data.
Control hierarchy
- Global policy engine: periodically evaluates system‑wide metrics (e.g., queue lengths, latency histograms) and emits routing and scaling decisions.
- Local controllers: attached to each worker node, they receive policy updates and enforce them by adjusting task queues, throttling calls, or migrating futures.
Evaluation workloads – The authors built three representative agentic applications (e.g., multi‑tool planning, conversational assistants with external APIs, and autonomous data‑pipeline orchestration) and compared Nalar against a vanilla Python‑asyncio baseline and a commercial serverless orchestrator.

Results & Findings

Metric	Baseline	Nalar	Improvement
99th‑percentile latency	1.8 s	0.5 s – 1.2 s	34 % – 74 % reduction
Throughput (requests per second)	30 RPS (fails >40 RPS)	80 RPS (stable)	~2.7× higher
End‑to‑end speedup (average)	1.0×	1.8× – 2.9×	Up to 2.9×
Control overhead (per 1 k futures)	1.2 s	0.48 s	~60 % lower
Max concurrent futures handled	~30 K	130 K	>4× scaling

The experiments show that Nalar’s adaptive routing and state management keep long‑running, latency‑sensitive agent pipelines from stalling, even under bursty traffic patterns.

Practical Implications

Simplified developer experience – Teams can keep their existing Python codebases; no need to rewrite agents as microservices or embed custom orchestration logic.
Cost‑effective scaling – By automatically throttling and migrating work, Nalar reduces over‑provisioning of compute resources, which is valuable for cloud‑native AI services.
Robustness for production AI – Deterministic retries and state migration mean fewer “ghost” failures when external APIs (e.g., payment gateways, knowledge bases) become temporarily unavailable.
Policy hooks for SLAs – Operators can encode business‑level policies (e.g., prioritize premium users, enforce per‑user rate limits) directly into the global controller without touching application code.
Foundation for “agent‑as‑a‑service” platforms – Companies building multi‑agent marketplaces can plug Nalar in to guarantee low tail latency while supporting heterogeneous toolsets (search, DB access, code execution, etc.).

Limitations & Future Work

Assumption of Python‑centric workloads – The current stub generation and state APIs are tied to Python; extending to other languages or polyglot environments will require additional engineering.
Control loop latency – While sub‑500 ms overhead is modest, ultra‑low‑latency use cases (e.g., high‑frequency trading bots) may still find the control latency a bottleneck.
External tool reliability – Nalar can mitigate but not eliminate latency spikes caused by third‑party services; future work could integrate predictive modeling to pre‑emptively reroute calls.
Security & multi‑tenant isolation – The paper focuses on performance; robust sandboxing and fine‑grained access control for shared state in multi‑tenant deployments remain open research directions.

Overall, Nalar offers a compelling blueprint for turning complex, LLM‑driven agent pipelines into production‑ready services without sacrificing developer agility. Its blend of future‑based orchestration, managed state, and policy‑driven control could become a cornerstone of next‑generation AI infrastructure.

Authors

Marco Laju
Donghyun Son
Saurabh Agarwal
Nitin Kedia
Myungjin Lee
Jayanth Srinivasa
Aditya Akella

Paper Information

arXiv ID: 2601.05109v1
Categories: cs.DC, cs.MA
Published: January 8, 2026
PDF: Download PDF

[Paper] Nalar: An agent serving framework

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

[Paper] LACIN: Linearly Arranged Complete Interconnection Networks

[Paper] Self-Evolving Distributed Memory Architecture for Scalable AI Systems