[Paper] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
Source: arXiv - 2512.23631v1
Overview
The paper introduces BOAD (Bandit Optimization for Agent Design), a framework that automatically builds hierarchical, multi‑agent software‑engineering assistants. By treating each possible sub‑agent (e.g., bug localizer, code editor, test validator) as an arm in a multi‑armed bandit, BOAD learns how to compose a team of specialists that outperforms monolithic LLM agents on real‑world, long‑horizon coding tasks.
Key Contributions
- Hierarchical Agent Discovery: Formulates the problem of finding effective sub‑agent hierarchies as a multi‑armed bandit (MAB) optimization, enabling scalable exploration of combinatorial design spaces.
- Credit Assignment Mechanism: Introduces a reward signal that measures each sub‑agent’s “helpfulness” when collaborating, solving the classic credit‑assignment challenge in multi‑agent teams.
- BOAD Framework: Provides an end‑to‑end pipeline (candidate generation → bandit‑driven selection → joint evaluation) that works under strict evaluation budgets.
- Empirical Gains: Demonstrates state‑of‑the‑art performance on SWE‑bench‑Verified and SWE‑bench‑Live, beating strong baselines including single‑agent LLMs and manually crafted multi‑agent systems.
- Open‑Source Release: Shares code, prompts, and evaluation scripts, facilitating reproducibility and community extensions.
Methodology
-
Candidate Sub‑Agent Pool:
- Start with a set of prompt templates and tool‑wrappers (e.g., “find the buggy file”, “apply a diff”, “run unit tests”).
- Each template + LLM pair constitutes a candidate arm.
-
Bandit Formulation:
- Arms: Individual candidate sub‑agents.
- Pull: Assemble a hierarchy (orchestrator + selected sub‑agents) and run it on a SWE task.
- Reward: A composite score combining task success (e.g., passing verification tests) and efficiency (e.g., number of LLM calls).
-
Exploration‑Exploitation Loop:
- Use a contextual MAB algorithm (e.g., Thompson Sampling) to balance trying new sub‑agents vs. exploiting known good ones.
- After each evaluation, update posterior beliefs about each arm’s utility, which directly influences future hierarchy proposals.
-
Orchestrator Design:
- A lightweight controller decides the execution order (localize → edit → validate) based on the current hierarchy.
- The orchestrator itself can be a simple rule‑based script; the novelty lies in the automatically discovered sub‑agents it coordinates.
-
Evaluation Budget:
- The bandit runs under a fixed number of total task evaluations (e.g., a few thousand), reflecting realistic constraints on LLM API usage.
Results & Findings
| Benchmark | BOAD (36B) | Single‑Agent 36B | Manually‑Designed Multi‑Agent | GPT‑4 | Claude |
|---|---|---|---|---|---|
| SWE‑bench‑Verified | +12.4% pass rate over single‑agent | – | +6.8% over manual multi‑agent | – | – |
| SWE‑bench‑Live (out‑of‑distribution) | 2nd place on leaderboard (≈ 1.8% behind top) | – | – | Lower than BOAD despite larger model size | Lower than BOAD |
- Generalization: BOAD’s hierarchies maintain higher success rates on newer, unseen issues, indicating better robustness to distribution shift.
- Efficiency: The bandit discovers useful sub‑agents using ~30% fewer LLM calls than exhaustive grid search.
- Ablation: Removing the credit‑assignment reward or limiting the hierarchy depth drops performance back to single‑agent levels, confirming the importance of both components.
Practical Implications
- Developer Tooling: IDE plugins could embed a BOAD‑trained orchestrator to automatically decompose bug reports, suggest targeted edits, and run verification, reducing the cognitive load on engineers.
- CI/CD Automation: Teams can plug BOAD‑derived agents into continuous integration pipelines to automatically triage failing tests, generate patches, and validate them before human review.
- Cost‑Effective AI Ops: Because BOAD learns to use specialized sub‑agents, the overall number of expensive LLM calls per issue drops, translating to lower API bills for enterprises.
- Extensibility: New sub‑agents (e.g., security scanner, performance profiler) can be added to the candidate pool; the bandit will automatically assess their utility without hand‑tuning.
Limitations & Future Work
- Candidate Dependence: BOAD can only discover hierarchies from the predefined pool of sub‑agents; truly novel capabilities require manual prompt engineering.
- Scalability of Hierarchy Depth: The current implementation caps hierarchy depth to keep the bandit tractable; deeper, more complex pipelines may need hierarchical bandits or reinforcement learning.
- Reward Noise: Success metrics (e.g., test pass) can be noisy for ambiguous tasks, potentially misleading the bandit; richer reward signals (e.g., code quality metrics) are a promising direction.
- Human‑in‑the‑Loop Evaluation: Real‑world adoption will need studies on how developers interact with automatically generated hierarchies and trust the suggested fixes.
BOAD showcases a compelling path toward modular, self‑optimizing AI assistants for software engineering, bridging the gap between powerful LLMs and the structured, collaborative workflows that human developers rely on.
Authors
- Iris Xu
- Guangtao Zeng
- Zexue He
- Charles Jin
- Aldo Pareja
- Dan Gutfreund
- Chuang Gan
- Zhang‑Wei Hong
Paper Information
- arXiv ID: 2512.23631v1
- Categories: cs.LG, cs.AI
- Published: December 29, 2025
- PDF: Download PDF