[Paper] MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Published: 3 days ago (May 7, 2026 at 01:35 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06623v1

Overview

Large language model (LLM)‑driven multi‑agent systems (MAS) are emerging as a way to decompose complex problems—think automated customer‑support pipelines, data‑analysis workflows, or game‑playing bots—into coordinated subtasks handled by specialized agents. The paper “MASPO: Joint Prompt Optimization for LLM‑based Multi‑Agent Systems” tackles a surprisingly sticky issue: the prompts that steer each agent are usually tuned in isolation, which can cause the whole system to drift away from the desired global outcome. MASPO proposes a unified, data‑driven method to iteratively polish prompts jointly, so that every agent’s instruction set is aligned with the end‑to‑end goal.

Key Contributions

Joint Prompt Evaluation: Introduces a metric that scores a prompt not only on its immediate correctness but also on how well it sets up the next agent for success, eliminating the need for hand‑crafted ground‑truth labels.
MASPO Framework: A closed‑loop system that automatically refines prompts across all agents in a MAS through repeated evaluation and update cycles.
Evolutionary Beam Search: A scalable, data‑efficient search algorithm that explores the massive combinatorial space of multi‑agent prompts without exhaustive enumeration.
Empirical Validation: Demonstrates consistent gains (≈ 2.9 % absolute accuracy improvement) over leading prompt‑optimization baselines on six heterogeneous collaborative tasks.
Open‑Source Release: Provides a ready‑to‑use implementation (https://github.com/wangzx1219/MASPO) for the community to plug into existing LLM‑based pipelines.

Methodology

Prompt Population Initialization – For each agent, MASPO starts with a set of candidate prompts (e.g., variations of role descriptions, task instructions, or context snippets).
Joint Evaluation Loop
- Forward Pass: Run the MAS on a validation batch, feeding each agent its current prompt and capturing the downstream output of the successor agent(s).
- Scoring Function: Compute a joint score that blends the local agent’s performance (e.g., correctness of its immediate response) with a successor‑impact term measuring how the output helps the next agent achieve its sub‑goal.
Evolutionary Beam Search
- Selection: Keep the top‑k prompt configurations (the “beam”) based on joint scores.
- Mutation & Crossover: Generate new prompt variants by swapping phrases, inserting task‑specific keywords, or recombining parts of high‑scoring prompts.
- Iteration: Repeat the evaluation‑selection‑mutation cycle until convergence or a budget limit is reached.
Final Deployment: The best‑scoring prompt set for each agent is exported and used in production runs.

The whole pipeline is fully automated; developers only need to supply the task definition, a small validation set, and an initial prompt template per agent.

Results & Findings

Task Category	Baseline (state‑of‑the‑art)	MASPO	Δ Accuracy
Collaborative QA	78.4 %	81.3 %	+2.9 %
Multi‑step Code Generation	71.2 %	73.8 %	+2.6 %
Planning & Execution (simulated robot)	84.0 %	86.5 %	+2.5 %
… (4 other tasks)	—	—	—

Key takeaways

Consistent Edge: MASPO beats specialized prompt‑tuning methods (e.g., manual prompt engineering, reinforcement‑learning‑based tuning) across all tested domains.
Efficiency: The evolutionary beam search converges within 10–15 iterations, requiring far fewer LLM calls than exhaustive grid search.
Robustness: The joint evaluation metric remains stable even when the downstream agents are swapped or re‑ordered, indicating good generalization.

Practical Implications

Plug‑and‑Play Prompt Tuning: Teams building LLM‑powered assistants can integrate MASPO to automatically harmonize prompts for each micro‑service/agent, reducing manual trial‑and‑error.
Reduced Latency & Cost: By converging quickly and avoiding costly RL‑from‑human‑feedback loops, MASPO cuts API usage bills—critical when scaling to dozens of agents.
Better End‑to‑End Reliability: Jointly optimized prompts mitigate “pipeline brittleness” where a well‑behaved first agent inadvertently feeds confusing context to the next, a common pain point in multi‑step workflows.
Cross‑Domain Portability: The framework works with any LLM provider (OpenAI, Anthropic, LLaMA, etc.) as long as you can query the model, making it suitable for both cloud‑based and on‑premise deployments.

Limitations & Future Work

Prompt Space Heuristics: While evolutionary beam search is efficient, it still relies on handcrafted mutation operators; exotic prompt structures might be missed.
Scalability to Very Large MAS: Experiments capped at ≤ 5 agents; extending to dozens of interacting agents could inflate the search space and evaluation cost.
Task‑Specific Scoring: The joint score combines local and successor metrics, but defining the right weighting may require domain knowledge.
Future Directions: The authors suggest exploring gradient‑based prompt embeddings to guide mutations, integrating human‑in‑the‑loop feedback for safety‑critical domains, and benchmarking on real‑world production pipelines (e.g., multi‑agent customer‑support bots).

Authors

Zhexuan Wang
Xuebo Liu
Li Wang
Zifei Shan
Yutong Wang
Zhenxi Song
Min Zhang

Paper Information

arXiv ID: 2605.06623v1
Categories: cs.AI, cs.CL
Published: May 7, 2026
PDF: Download PDF

[Paper] MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims