[Paper] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Published: (January 15, 2026 at 11:23 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10560v1

Overview

The paper introduces LAMaS (Latency‑Aware Multi‑agent System), a framework that teaches a controller to orchestrate multiple AI agents in parallel while explicitly minimizing the end‑to‑end latency of the critical execution path. By treating latency as a first‑class supervision signal, the authors show that you can cut the longest‑running chain of operations by up to 46 % without sacrificing—and sometimes even improving—overall task performance.

Key Contributions

  • Latency‑aware orchestration: Formulates the multi‑agent coordination problem as a latency‑supervised learning task, targeting the critical path rather than just total compute cost.
  • Parallel execution graph construction: Enables the controller to generate execution topology graphs that schedule agents concurrently, exploiting hardware parallelism.
  • LAMaS framework: A concrete implementation that integrates latency supervision into the neural architecture search (NAS) loop for multi‑agent systems.
  • Empirical gains: Demonstrates 38‑46 % reduction in critical path length across several benchmark MAS tasks, with equal or better accuracy compared to the previous state‑of‑the‑art (SOTA) MAS‑NAS methods.
  • Open‑source release: Provides the full codebase (https://github.com/xishi404/LAMaS) for reproducibility and community extension.

Methodology

  1. Problem formulation – The authors view a multi‑agent system as a directed acyclic graph (DAG) where nodes are individual agents (e.g., language models, planners) and edges represent data dependencies. The critical path is the longest‑duration chain from input to output.
  2. Latency supervision – During training, the framework measures the wall‑clock latency of each candidate DAG on the target hardware. This latency signal is fed back to a controller network that predicts better topologies.
  3. Controller architecture – A reinforcement‑learning (RL) controller samples graph structures (agent selections + wiring) and receives a composite reward: a weighted sum of task performance (e.g., accuracy, reward) and the measured latency.
  4. Parallel execution engine – The sampled graph is executed on a parallel runtime that schedules independent agents concurrently, respecting data dependencies. This yields the actual latency used for supervision.
  5. Search loop – The controller iteratively refines its policy using policy‑gradient updates, gradually biasing toward graphs that achieve low latency while maintaining high task scores.

The overall pipeline is similar to existing NAS approaches but replaces the usual FLOPs or parameter count proxy with real‑world latency, and it explicitly models parallelism rather than assuming a sequential execution order.

Results & Findings

BenchmarkBaseline (SOTA MAS‑NAS)LAMaSCritical‑Path ReductionTask Performance
Multi‑turn Dialogue1.23 s0.71 s42 %+1.2 % Exact Match
Collaborative Navigation2.05 s1.12 s45 %±0 % Success Rate
Multi‑agent Reasoning (HotpotQA)1.78 s0.96 s46 %+0.4 % F1
  • Latency gains are consistent across diverse tasks, confirming that the controller learns to place latency‑heavy agents later in the graph or to split them into parallel branches.
  • Task metrics are either unchanged or slightly improved, indicating that latency optimization does not force a trade‑off with reasoning quality.
  • Ablation studies show that removing latency supervision or forcing sequential execution erodes the gains, underscoring the importance of both components.

Practical Implications

  • Faster user‑facing AI services: Chatbots, virtual assistants, or collaborative bots can respond noticeably quicker, which is critical for real‑time user experiences.
  • Cost‑effective scaling: By shrinking the critical path, you can achieve higher throughput on the same hardware, reducing cloud compute bills for large‑scale deployments.
  • Edge and mobile deployment: Latency‑aware orchestration makes it feasible to run multi‑agent pipelines on resource‑constrained devices where parallel cores are available but total compute budget is tight.
  • Developer tooling: The open‑source LAMaS package can be integrated into existing MAS pipelines (e.g., LangChain, AutoGPT) to automatically search for low‑latency orchestrations without hand‑tuning.
  • Hardware‑aware AI design: Encourages a shift from “model‑centric” optimization (accuracy, parameters) toward “system‑centric” design that treats the execution graph as a first‑order artifact.

Limitations & Future Work

  • Hardware dependence: Latency measurements are tied to the specific hardware used during search; transferring the learned orchestration to a different platform may require re‑evaluation.
  • Search cost: The RL‑based search loop incurs non‑trivial compute overhead, especially for very large agent libraries.
  • Static graphs: LAMaS currently produces a fixed orchestration per task; dynamic adaptation at runtime (e.g., based on current load) is not explored.
  • Broader benchmarks: Experiments focus on a handful of standard MAS tasks; applying the method to ultra‑large language model ensembles or heterogeneous sensor‑actuator systems remains open.

Future research directions include hardware‑agnostic latency proxies, online adaptation of the execution graph, and extending the framework to heterogeneous clusters (CPU + GPU + TPU) where parallelism patterns differ.

Authors

  • Xi Shi
  • Mengxin Zheng
  • Qian Lou

Paper Information

  • arXiv ID: 2601.10560v1
  • Categories: cs.MA, cs.AI, cs.CL
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »