[Paper] MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
Source: arXiv - 2605.06623v1
Overview
Large language model (LLM)‑driven multi‑agent systems (MAS) are emerging as a way to decompose complex problems—think automated customer‑support pipelines, data‑analysis workflows, or game‑playing bots—into coordinated subtasks handled by specialized agents. The paper “MASPO: Joint Prompt Optimization for LLM‑based Multi‑Agent Systems” tackles a surprisingly sticky issue: the prompts that steer each agent are usually tuned in isolation, which can cause the whole system to drift away from the desired global outcome. MASPO proposes a unified, data‑driven method to iteratively polish prompts jointly, so that every agent’s instruction set is aligned with the end‑to‑end goal.
Key Contributions
- Joint Prompt Evaluation: Introduces a metric that scores a prompt not only on its immediate correctness but also on how well it sets up the next agent for success, eliminating the need for hand‑crafted ground‑truth labels.
- MASPO Framework: A closed‑loop system that automatically refines prompts across all agents in a MAS through repeated evaluation and update cycles.
- Evolutionary Beam Search: A scalable, data‑efficient search algorithm that explores the massive combinatorial space of multi‑agent prompts without exhaustive enumeration.
- Empirical Validation: Demonstrates consistent gains (≈ 2.9 % absolute accuracy improvement) over leading prompt‑optimization baselines on six heterogeneous collaborative tasks.
- Open‑Source Release: Provides a ready‑to‑use implementation (https://github.com/wangzx1219/MASPO) for the community to plug into existing LLM‑based pipelines.
Methodology
- Prompt Population Initialization – For each agent, MASPO starts with a set of candidate prompts (e.g., variations of role descriptions, task instructions, or context snippets).
- Joint Evaluation Loop
- Forward Pass: Run the MAS on a validation batch, feeding each agent its current prompt and capturing the downstream output of the successor agent(s).
- Scoring Function: Compute a joint score that blends the local agent’s performance (e.g., correctness of its immediate response) with a successor‑impact term measuring how the output helps the next agent achieve its sub‑goal.
- Evolutionary Beam Search
- Selection: Keep the top‑k prompt configurations (the “beam”) based on joint scores.
- Mutation & Crossover: Generate new prompt variants by swapping phrases, inserting task‑specific keywords, or recombining parts of high‑scoring prompts.
- Iteration: Repeat the evaluation‑selection‑mutation cycle until convergence or a budget limit is reached.
- Final Deployment: The best‑scoring prompt set for each agent is exported and used in production runs.
The whole pipeline is fully automated; developers only need to supply the task definition, a small validation set, and an initial prompt template per agent.
Results & Findings
| Task Category | Baseline (state‑of‑the‑art) | MASPO | Δ Accuracy |
|---|---|---|---|
| Collaborative QA | 78.4 % | 81.3 % | +2.9 % |
| Multi‑step Code Generation | 71.2 % | 73.8 % | +2.6 % |
| Planning & Execution (simulated robot) | 84.0 % | 86.5 % | +2.5 % |
| … (4 other tasks) | — | — | — |
Key takeaways
- Consistent Edge: MASPO beats specialized prompt‑tuning methods (e.g., manual prompt engineering, reinforcement‑learning‑based tuning) across all tested domains.
- Efficiency: The evolutionary beam search converges within 10–15 iterations, requiring far fewer LLM calls than exhaustive grid search.
- Robustness: The joint evaluation metric remains stable even when the downstream agents are swapped or re‑ordered, indicating good generalization.
Practical Implications
- Plug‑and‑Play Prompt Tuning: Teams building LLM‑powered assistants can integrate MASPO to automatically harmonize prompts for each micro‑service/agent, reducing manual trial‑and‑error.
- Reduced Latency & Cost: By converging quickly and avoiding costly RL‑from‑human‑feedback loops, MASPO cuts API usage bills—critical when scaling to dozens of agents.
- Better End‑to‑End Reliability: Jointly optimized prompts mitigate “pipeline brittleness” where a well‑behaved first agent inadvertently feeds confusing context to the next, a common pain point in multi‑step workflows.
- Cross‑Domain Portability: The framework works with any LLM provider (OpenAI, Anthropic, LLaMA, etc.) as long as you can query the model, making it suitable for both cloud‑based and on‑premise deployments.
Limitations & Future Work
- Prompt Space Heuristics: While evolutionary beam search is efficient, it still relies on handcrafted mutation operators; exotic prompt structures might be missed.
- Scalability to Very Large MAS: Experiments capped at ≤ 5 agents; extending to dozens of interacting agents could inflate the search space and evaluation cost.
- Task‑Specific Scoring: The joint score combines local and successor metrics, but defining the right weighting may require domain knowledge.
- Future Directions: The authors suggest exploring gradient‑based prompt embeddings to guide mutations, integrating human‑in‑the‑loop feedback for safety‑critical domains, and benchmarking on real‑world production pipelines (e.g., multi‑agent customer‑support bots).
Authors
- Zhexuan Wang
- Xuebo Liu
- Li Wang
- Zifei Shan
- Yutong Wang
- Zhenxi Song
- Min Zhang
Paper Information
- arXiv ID: 2605.06623v1
- Categories: cs.AI, cs.CL
- Published: May 7, 2026
- PDF: Download PDF