[Paper] Nalar: An agent serving framework

Published: (January 8, 2026 at 11:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05109v1

Overview

The paper introduces Nalar, a purpose‑built serving framework for large‑language‑model (LLM)‑driven agents. By cleanly separating what an agent workflow should do from how it is executed, Nalar lets developers write ordinary Python code while the system handles orchestration, state management, and latency‑aware scheduling. The result is a dramatic reduction in tail latency and higher throughput for complex, multi‑step AI applications.

Key Contributions

  • Unified workflow abstraction – agents and tools are expressed as ordinary Python functions; Nalar automatically generates lightweight future stubs that capture dependencies and execution context.
  • Managed state layer – logical state is decoupled from its physical placement, enabling safe reuse, migration, and deterministic retries without programmer‑level bookkeeping.
  • Two‑level control architecture – a global policy engine computes high‑level routing and resource policies, while local event‑driven controllers enforce them in real time.
  • Policy‑driven adaptive scheduling – supports dynamic routing, load‑balancing, and resource throttling based on observed latency, workload characteristics, and SLA constraints.
  • Scalable runtime – demonstrated ability to handle >130 K concurrent futures with sub‑500 ms control overhead, and to sustain 80 RPS where existing baselines collapse.

Methodology

  1. Future‑based API – When a developer calls an agent or a tool, Nalar replaces the call with a future object that records the call’s inputs, required resources, and any downstream dependencies. The future is a lightweight placeholder that can be scheduled independently.
  2. State abstraction – All mutable data lives in a managed state store. The store exposes a simple key‑value interface but internally tracks versioning and placement, allowing the runtime to move state across machines or retry operations without corrupting user data.
  3. Control hierarchy
    • Global policy engine: periodically evaluates system‑wide metrics (e.g., queue lengths, latency histograms) and emits routing and scaling decisions.
    • Local controllers: attached to each worker node, they receive policy updates and enforce them by adjusting task queues, throttling calls, or migrating futures.
  4. Evaluation workloads – The authors built three representative agentic applications (e.g., multi‑tool planning, conversational assistants with external APIs, and autonomous data‑pipeline orchestration) and compared Nalar against a vanilla Python‑asyncio baseline and a commercial serverless orchestrator.

Results & Findings

MetricBaselineNalarImprovement
99th‑percentile latency1.8 s0.5 s – 1.2 s34 % – 74 % reduction
Throughput (requests per second)30 RPS (fails >40 RPS)80 RPS (stable)~2.7× higher
End‑to‑end speedup (average)1.0×1.8× – 2.9×Up to 2.9×
Control overhead (per 1 k futures)1.2 s0.48 s~60 % lower
Max concurrent futures handled~30 K130 K>4× scaling

The experiments show that Nalar’s adaptive routing and state management keep long‑running, latency‑sensitive agent pipelines from stalling, even under bursty traffic patterns.

Practical Implications

  • Simplified developer experience – Teams can keep their existing Python codebases; no need to rewrite agents as microservices or embed custom orchestration logic.
  • Cost‑effective scaling – By automatically throttling and migrating work, Nalar reduces over‑provisioning of compute resources, which is valuable for cloud‑native AI services.
  • Robustness for production AI – Deterministic retries and state migration mean fewer “ghost” failures when external APIs (e.g., payment gateways, knowledge bases) become temporarily unavailable.
  • Policy hooks for SLAs – Operators can encode business‑level policies (e.g., prioritize premium users, enforce per‑user rate limits) directly into the global controller without touching application code.
  • Foundation for “agent‑as‑a‑service” platforms – Companies building multi‑agent marketplaces can plug Nalar in to guarantee low tail latency while supporting heterogeneous toolsets (search, DB access, code execution, etc.).

Limitations & Future Work

  • Assumption of Python‑centric workloads – The current stub generation and state APIs are tied to Python; extending to other languages or polyglot environments will require additional engineering.
  • Control loop latency – While sub‑500 ms overhead is modest, ultra‑low‑latency use cases (e.g., high‑frequency trading bots) may still find the control latency a bottleneck.
  • External tool reliability – Nalar can mitigate but not eliminate latency spikes caused by third‑party services; future work could integrate predictive modeling to pre‑emptively reroute calls.
  • Security & multi‑tenant isolation – The paper focuses on performance; robust sandboxing and fine‑grained access control for shared state in multi‑tenant deployments remain open research directions.

Overall, Nalar offers a compelling blueprint for turning complex, LLM‑driven agent pipelines into production‑ready services without sacrificing developer agility. Its blend of future‑based orchestration, managed state, and policy‑driven control could become a cornerstone of next‑generation AI infrastructure.

Authors

  • Marco Laju
  • Donghyun Son
  • Saurabh Agarwal
  • Nitin Kedia
  • Myungjin Lee
  • Jayanth Srinivasa
  • Aditya Akella

Paper Information

  • arXiv ID: 2601.05109v1
  • Categories: cs.DC, cs.MA
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »