[Paper] MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Published: (May 7, 2026 at 01:35 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06623v1

Overview

Large language model (LLM)‑driven multi‑agent systems (MAS) are emerging as a way to decompose complex problems—think automated customer‑support pipelines, data‑analysis workflows, or game‑playing bots—into coordinated subtasks handled by specialized agents. The paper “MASPO: Joint Prompt Optimization for LLM‑based Multi‑Agent Systems” tackles a surprisingly sticky issue: the prompts that steer each agent are usually tuned in isolation, which can cause the whole system to drift away from the desired global outcome. MASPO proposes a unified, data‑driven method to iteratively polish prompts jointly, so that every agent’s instruction set is aligned with the end‑to‑end goal.

Key Contributions

  • Joint Prompt Evaluation: Introduces a metric that scores a prompt not only on its immediate correctness but also on how well it sets up the next agent for success, eliminating the need for hand‑crafted ground‑truth labels.
  • MASPO Framework: A closed‑loop system that automatically refines prompts across all agents in a MAS through repeated evaluation and update cycles.
  • Evolutionary Beam Search: A scalable, data‑efficient search algorithm that explores the massive combinatorial space of multi‑agent prompts without exhaustive enumeration.
  • Empirical Validation: Demonstrates consistent gains (≈ 2.9 % absolute accuracy improvement) over leading prompt‑optimization baselines on six heterogeneous collaborative tasks.
  • Open‑Source Release: Provides a ready‑to‑use implementation (https://github.com/wangzx1219/MASPO) for the community to plug into existing LLM‑based pipelines.

Methodology

  1. Prompt Population Initialization – For each agent, MASPO starts with a set of candidate prompts (e.g., variations of role descriptions, task instructions, or context snippets).
  2. Joint Evaluation Loop
    • Forward Pass: Run the MAS on a validation batch, feeding each agent its current prompt and capturing the downstream output of the successor agent(s).
    • Scoring Function: Compute a joint score that blends the local agent’s performance (e.g., correctness of its immediate response) with a successor‑impact term measuring how the output helps the next agent achieve its sub‑goal.
  3. Evolutionary Beam Search
    • Selection: Keep the top‑k prompt configurations (the “beam”) based on joint scores.
    • Mutation & Crossover: Generate new prompt variants by swapping phrases, inserting task‑specific keywords, or recombining parts of high‑scoring prompts.
    • Iteration: Repeat the evaluation‑selection‑mutation cycle until convergence or a budget limit is reached.
  4. Final Deployment: The best‑scoring prompt set for each agent is exported and used in production runs.

The whole pipeline is fully automated; developers only need to supply the task definition, a small validation set, and an initial prompt template per agent.

Results & Findings

Task CategoryBaseline (state‑of‑the‑art)MASPOΔ Accuracy
Collaborative QA78.4 %81.3 %+2.9 %
Multi‑step Code Generation71.2 %73.8 %+2.6 %
Planning & Execution (simulated robot)84.0 %86.5 %+2.5 %
… (4 other tasks)

Key takeaways

  • Consistent Edge: MASPO beats specialized prompt‑tuning methods (e.g., manual prompt engineering, reinforcement‑learning‑based tuning) across all tested domains.
  • Efficiency: The evolutionary beam search converges within 10–15 iterations, requiring far fewer LLM calls than exhaustive grid search.
  • Robustness: The joint evaluation metric remains stable even when the downstream agents are swapped or re‑ordered, indicating good generalization.

Practical Implications

  • Plug‑and‑Play Prompt Tuning: Teams building LLM‑powered assistants can integrate MASPO to automatically harmonize prompts for each micro‑service/agent, reducing manual trial‑and‑error.
  • Reduced Latency & Cost: By converging quickly and avoiding costly RL‑from‑human‑feedback loops, MASPO cuts API usage bills—critical when scaling to dozens of agents.
  • Better End‑to‑End Reliability: Jointly optimized prompts mitigate “pipeline brittleness” where a well‑behaved first agent inadvertently feeds confusing context to the next, a common pain point in multi‑step workflows.
  • Cross‑Domain Portability: The framework works with any LLM provider (OpenAI, Anthropic, LLaMA, etc.) as long as you can query the model, making it suitable for both cloud‑based and on‑premise deployments.

Limitations & Future Work

  • Prompt Space Heuristics: While evolutionary beam search is efficient, it still relies on handcrafted mutation operators; exotic prompt structures might be missed.
  • Scalability to Very Large MAS: Experiments capped at ≤ 5 agents; extending to dozens of interacting agents could inflate the search space and evaluation cost.
  • Task‑Specific Scoring: The joint score combines local and successor metrics, but defining the right weighting may require domain knowledge.
  • Future Directions: The authors suggest exploring gradient‑based prompt embeddings to guide mutations, integrating human‑in‑the‑loop feedback for safety‑critical domains, and benchmarking on real‑world production pipelines (e.g., multi‑agent customer‑support bots).

Authors

  • Zhexuan Wang
  • Xuebo Liu
  • Li Wang
  • Zifei Shan
  • Yutong Wang
  • Zhenxi Song
  • Min Zhang

Paper Information

  • arXiv ID: 2605.06623v1
  • Categories: cs.AI, cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...