[Paper] BAMAS: Structuring Budget-Aware Multi-Agent Systems

Published: (November 26, 2025 at 11:48 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21572v1

Overview

Large‑language‑model (LLM) powered multi‑agent systems are proving capable of tackling intricate, multi‑step problems, but their operational costs can quickly become prohibitive. The paper “BAMAS: Structuring Budget‑Aware Multi‑Agent Systems” introduces a systematic way to design such systems while staying inside a predefined budget, striking a balance between performance and expense.

Key Contributions

  • Budget‑driven agent selection: Formulates the choice of LLMs as an Integer Linear Programming (ILP) problem that jointly optimizes task performance and monetary cost.
  • Topology‑aware collaboration: Uses reinforcement learning (RL) to discover an interaction graph (who talks to whom) that maximizes efficiency under the chosen budget.
  • End‑to‑end pipeline: Provides a practical workflow—select → structure → instantiate—that can be applied to any LLM‑based multi‑agent application.
  • Empirical validation: Demonstrates up to 86 % cost reduction on three benchmark tasks while keeping accuracy on par with state‑of‑the‑art (SOTA) baselines.

Methodology

  1. Define the budget and candidate LLM pool – Each candidate model (e.g., GPT‑3.5, Claude‑1, LLaMA‑2) is annotated with its per‑token price and an estimated performance score for the target task.
  2. ILP‑based selection – The system solves an integer linear program that picks a subset of models whose total cost ≤ budget while maximizing a weighted sum of their performance scores.
  3. RL‑driven topology search – With the selected agents fixed, a reinforcement‑learning agent proposes edges in a directed graph (e.g., “Agent A sends its output to Agent B”). The reward combines task success (e.g., accuracy, completion rate) and the marginal cost of extra communication.
  4. Instantiation & execution – The final graph is materialized: each node runs its assigned LLM, exchanges messages according to the learned topology, and produces the overall solution.

The approach is deliberately modular: you can swap the ILP solver, replace the RL algorithm, or plug in a different cost model without redesigning the whole pipeline.

Results & Findings

Task (benchmark)Baseline (SOTA) CostBAMAS CostCost ReductionPerformance Δ
Complex reasoning (Chain‑of‑Thought)$1.20 per query$0.17 per query86 %±0.2 %
Multi‑turn planning$0.95 per query$0.28 per query71 %+0.1 %
Knowledge‑intensive QA$0.78 per query$0.32 per query59 %–0.3 %

Key takeaways

  • Cost savings are achieved without sacrificing accuracy – the performance gap is within statistical noise for all three tasks.
  • Hybrid agent mixes outperform single‑model baselines – e.g., pairing a cheap, fast model for preprocessing with a premium model for final verification yields the best trade‑off.
  • Learned topologies are often sparse, confirming that many interactions are unnecessary and can be pruned to save API calls.

Practical Implications

  • Product teams can set a hard budget (e.g., $0.05 per user request) and let BAMAS automatically configure the cheapest viable agent ensemble, removing the need for manual trial‑and‑error.
  • Server‑less deployments become feasible: by minimizing token usage, developers can run LLM‑driven assistants on low‑cost cloud functions or even on‑device inference for edge cases.
  • Dynamic scaling – BAMAS can be re‑run when pricing changes (e.g., new model releases) to instantly re‑optimize the agent pool, ensuring continuous cost‑effectiveness.
  • Explainability for cost – The ILP formulation provides a clear audit trail of why a particular model was chosen, useful for compliance and budgeting reports.

Limitations & Future Work

  • Static budget assumption: The current pipeline optimizes for a single, fixed budget per deployment; handling fluctuating budgets (e.g., burst traffic) requires extensions.
  • Performance estimation reliance: The ILP needs accurate prior performance scores for each candidate LLM, which may be noisy for novel tasks.
  • Scalability of RL topology search: While effective for up to ~10 agents, the search space grows combinatorially; future work could explore graph‑neural‑network‑based topology predictors.
  • Broader evaluation: The authors test three tasks; applying BAMAS to domains like autonomous robotics or real‑time gaming would further validate its generality.

Authors

  • Liming Yang
  • Junyu Luo
  • Xuanzhe Liu
  • Yiling Lou
  • Zhenpeng Chen

Paper Information

  • arXiv ID: 2511.21572v1
  • Categories: cs.MA, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »