[Paper] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Published: 3 weeks ago (January 15, 2026 at 11:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10560v1

Overview

The paper introduces LAMaS (Latency‑Aware Multi‑agent System), a framework that teaches a controller to orchestrate multiple AI agents in parallel while explicitly minimizing the end‑to‑end latency of the critical execution path. By treating latency as a first‑class supervision signal, the authors show that you can cut the longest‑running chain of operations by up to 46 % without sacrificing—and sometimes even improving—overall task performance.

Key Contributions

Latency‑aware orchestration: Formulates the multi‑agent coordination problem as a latency‑supervised learning task, targeting the critical path rather than just total compute cost.
Parallel execution graph construction: Enables the controller to generate execution topology graphs that schedule agents concurrently, exploiting hardware parallelism.
LAMaS framework: A concrete implementation that integrates latency supervision into the neural architecture search (NAS) loop for multi‑agent systems.
Empirical gains: Demonstrates 38‑46 % reduction in critical path length across several benchmark MAS tasks, with equal or better accuracy compared to the previous state‑of‑the‑art (SOTA) MAS‑NAS methods.
Open‑source release: Provides the full codebase (https://github.com/xishi404/LAMaS) for reproducibility and community extension.

Methodology

Problem formulation – The authors view a multi‑agent system as a directed acyclic graph (DAG) where nodes are individual agents (e.g., language models, planners) and edges represent data dependencies. The critical path is the longest‑duration chain from input to output.
Latency supervision – During training, the framework measures the wall‑clock latency of each candidate DAG on the target hardware. This latency signal is fed back to a controller network that predicts better topologies.
Controller architecture – A reinforcement‑learning (RL) controller samples graph structures (agent selections + wiring) and receives a composite reward: a weighted sum of task performance (e.g., accuracy, reward) and the measured latency.
Parallel execution engine – The sampled graph is executed on a parallel runtime that schedules independent agents concurrently, respecting data dependencies. This yields the actual latency used for supervision.
Search loop – The controller iteratively refines its policy using policy‑gradient updates, gradually biasing toward graphs that achieve low latency while maintaining high task scores.

The overall pipeline is similar to existing NAS approaches but replaces the usual FLOPs or parameter count proxy with real‑world latency, and it explicitly models parallelism rather than assuming a sequential execution order.

Results & Findings

Benchmark	Baseline (SOTA MAS‑NAS)	LAMaS	Critical‑Path Reduction	Task Performance
Multi‑turn Dialogue	1.23 s	0.71 s	42 %	+1.2 % Exact Match
Collaborative Navigation	2.05 s	1.12 s	45 %	±0 % Success Rate
Multi‑agent Reasoning (HotpotQA)	1.78 s	0.96 s	46 %	+0.4 % F1

Latency gains are consistent across diverse tasks, confirming that the controller learns to place latency‑heavy agents later in the graph or to split them into parallel branches.
Task metrics are either unchanged or slightly improved, indicating that latency optimization does not force a trade‑off with reasoning quality.
Ablation studies show that removing latency supervision or forcing sequential execution erodes the gains, underscoring the importance of both components.

Practical Implications

Faster user‑facing AI services: Chatbots, virtual assistants, or collaborative bots can respond noticeably quicker, which is critical for real‑time user experiences.
Cost‑effective scaling: By shrinking the critical path, you can achieve higher throughput on the same hardware, reducing cloud compute bills for large‑scale deployments.
Edge and mobile deployment: Latency‑aware orchestration makes it feasible to run multi‑agent pipelines on resource‑constrained devices where parallel cores are available but total compute budget is tight.
Developer tooling: The open‑source LAMaS package can be integrated into existing MAS pipelines (e.g., LangChain, AutoGPT) to automatically search for low‑latency orchestrations without hand‑tuning.
Hardware‑aware AI design: Encourages a shift from “model‑centric” optimization (accuracy, parameters) toward “system‑centric” design that treats the execution graph as a first‑order artifact.

Limitations & Future Work

Hardware dependence: Latency measurements are tied to the specific hardware used during search; transferring the learned orchestration to a different platform may require re‑evaluation.
Search cost: The RL‑based search loop incurs non‑trivial compute overhead, especially for very large agent libraries.
Static graphs: LAMaS currently produces a fixed orchestration per task; dynamic adaptation at runtime (e.g., based on current load) is not explored.
Broader benchmarks: Experiments focus on a handful of standard MAS tasks; applying the method to ultra‑large language model ensembles or heterogeneous sensor‑actuator systems remains open.

Future research directions include hardware‑agnostic latency proxies, online adaptation of the execution graph, and extending the framework to heterogeneous clusters (CPU + GPU + TPU) where parallelism patterns differ.

Authors

Xi Shi
Mengxin Zheng
Qian Lou

Paper Information

arXiv ID: 2601.10560v1
Categories: cs.MA, cs.AI, cs.CL
Published: January 15, 2026
PDF: Download PDF

[Paper] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models