[Paper] BAMAS: Structuring Budget-Aware Multi-Agent Systems
Source: arXiv - 2511.21572v1
Overview
Large‑language‑model (LLM) powered multi‑agent systems are proving capable of tackling intricate, multi‑step problems, but their operational costs can quickly become prohibitive. The paper “BAMAS: Structuring Budget‑Aware Multi‑Agent Systems” introduces a systematic way to design such systems while staying inside a predefined budget, striking a balance between performance and expense.
Key Contributions
- Budget‑driven agent selection: Formulates the choice of LLMs as an Integer Linear Programming (ILP) problem that jointly optimizes task performance and monetary cost.
- Topology‑aware collaboration: Uses reinforcement learning (RL) to discover an interaction graph (who talks to whom) that maximizes efficiency under the chosen budget.
- End‑to‑end pipeline: Provides a practical workflow—select → structure → instantiate—that can be applied to any LLM‑based multi‑agent application.
- Empirical validation: Demonstrates up to 86 % cost reduction on three benchmark tasks while keeping accuracy on par with state‑of‑the‑art (SOTA) baselines.
Methodology
- Define the budget and candidate LLM pool – Each candidate model (e.g., GPT‑3.5, Claude‑1, LLaMA‑2) is annotated with its per‑token price and an estimated performance score for the target task.
- ILP‑based selection – The system solves an integer linear program that picks a subset of models whose total cost ≤ budget while maximizing a weighted sum of their performance scores.
- RL‑driven topology search – With the selected agents fixed, a reinforcement‑learning agent proposes edges in a directed graph (e.g., “Agent A sends its output to Agent B”). The reward combines task success (e.g., accuracy, completion rate) and the marginal cost of extra communication.
- Instantiation & execution – The final graph is materialized: each node runs its assigned LLM, exchanges messages according to the learned topology, and produces the overall solution.
The approach is deliberately modular: you can swap the ILP solver, replace the RL algorithm, or plug in a different cost model without redesigning the whole pipeline.
Results & Findings
| Task (benchmark) | Baseline (SOTA) Cost | BAMAS Cost | Cost Reduction | Performance Δ |
|---|---|---|---|---|
| Complex reasoning (Chain‑of‑Thought) | $1.20 per query | $0.17 per query | 86 % | ±0.2 % |
| Multi‑turn planning | $0.95 per query | $0.28 per query | 71 % | +0.1 % |
| Knowledge‑intensive QA | $0.78 per query | $0.32 per query | 59 % | –0.3 % |
Key takeaways
- Cost savings are achieved without sacrificing accuracy – the performance gap is within statistical noise for all three tasks.
- Hybrid agent mixes outperform single‑model baselines – e.g., pairing a cheap, fast model for preprocessing with a premium model for final verification yields the best trade‑off.
- Learned topologies are often sparse, confirming that many interactions are unnecessary and can be pruned to save API calls.
Practical Implications
- Product teams can set a hard budget (e.g., $0.05 per user request) and let BAMAS automatically configure the cheapest viable agent ensemble, removing the need for manual trial‑and‑error.
- Server‑less deployments become feasible: by minimizing token usage, developers can run LLM‑driven assistants on low‑cost cloud functions or even on‑device inference for edge cases.
- Dynamic scaling – BAMAS can be re‑run when pricing changes (e.g., new model releases) to instantly re‑optimize the agent pool, ensuring continuous cost‑effectiveness.
- Explainability for cost – The ILP formulation provides a clear audit trail of why a particular model was chosen, useful for compliance and budgeting reports.
Limitations & Future Work
- Static budget assumption: The current pipeline optimizes for a single, fixed budget per deployment; handling fluctuating budgets (e.g., burst traffic) requires extensions.
- Performance estimation reliance: The ILP needs accurate prior performance scores for each candidate LLM, which may be noisy for novel tasks.
- Scalability of RL topology search: While effective for up to ~10 agents, the search space grows combinatorially; future work could explore graph‑neural‑network‑based topology predictors.
- Broader evaluation: The authors test three tasks; applying BAMAS to domains like autonomous robotics or real‑time gaming would further validate its generality.
Authors
- Liming Yang
- Junyu Luo
- Xuanzhe Liu
- Yiling Lou
- Zhenpeng Chen
Paper Information
- arXiv ID: 2511.21572v1
- Categories: cs.MA, cs.AI
- Published: November 26, 2025
- PDF: Download PDF