[Paper] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Published: 3 weeks ago (December 29, 2025 at 12:41 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23631v1

Overview

The paper introduces BOAD (Bandit Optimization for Agent Design), a framework that automatically builds hierarchical, multi‑agent software‑engineering assistants. By treating each possible sub‑agent (e.g., bug localizer, code editor, test validator) as an arm in a multi‑armed bandit, BOAD learns how to compose a team of specialists that outperforms monolithic LLM agents on real‑world, long‑horizon coding tasks.

Key Contributions

Hierarchical Agent Discovery: Formulates the problem of finding effective sub‑agent hierarchies as a multi‑armed bandit (MAB) optimization, enabling scalable exploration of combinatorial design spaces.
Credit Assignment Mechanism: Introduces a reward signal that measures each sub‑agent’s “helpfulness” when collaborating, solving the classic credit‑assignment challenge in multi‑agent teams.
BOAD Framework: Provides an end‑to‑end pipeline (candidate generation → bandit‑driven selection → joint evaluation) that works under strict evaluation budgets.
Empirical Gains: Demonstrates state‑of‑the‑art performance on SWE‑bench‑Verified and SWE‑bench‑Live, beating strong baselines including single‑agent LLMs and manually crafted multi‑agent systems.
Open‑Source Release: Shares code, prompts, and evaluation scripts, facilitating reproducibility and community extensions.

Methodology

Candidate Sub‑Agent Pool:
- Start with a set of prompt templates and tool‑wrappers (e.g., “find the buggy file”, “apply a diff”, “run unit tests”).
- Each template + LLM pair constitutes a candidate arm.
Bandit Formulation:
- Arms: Individual candidate sub‑agents.
- Pull: Assemble a hierarchy (orchestrator + selected sub‑agents) and run it on a SWE task.
- Reward: A composite score combining task success (e.g., passing verification tests) and efficiency (e.g., number of LLM calls).
Exploration‑Exploitation Loop:
- Use a contextual MAB algorithm (e.g., Thompson Sampling) to balance trying new sub‑agents vs. exploiting known good ones.
- After each evaluation, update posterior beliefs about each arm’s utility, which directly influences future hierarchy proposals.
Orchestrator Design:
- A lightweight controller decides the execution order (localize → edit → validate) based on the current hierarchy.
- The orchestrator itself can be a simple rule‑based script; the novelty lies in the automatically discovered sub‑agents it coordinates.
Evaluation Budget:
- The bandit runs under a fixed number of total task evaluations (e.g., a few thousand), reflecting realistic constraints on LLM API usage.

Results & Findings

Benchmark	BOAD (36B)	Single‑Agent 36B	Manually‑Designed Multi‑Agent	GPT‑4	Claude
SWE‑bench‑Verified	+12.4% pass rate over single‑agent	–	+6.8% over manual multi‑agent	–	–
SWE‑bench‑Live (out‑of‑distribution)	2nd place on leaderboard (≈ 1.8% behind top)	–	–	Lower than BOAD despite larger model size	Lower than BOAD

Generalization: BOAD’s hierarchies maintain higher success rates on newer, unseen issues, indicating better robustness to distribution shift.
Efficiency: The bandit discovers useful sub‑agents using ~30% fewer LLM calls than exhaustive grid search.
Ablation: Removing the credit‑assignment reward or limiting the hierarchy depth drops performance back to single‑agent levels, confirming the importance of both components.

Practical Implications

Developer Tooling: IDE plugins could embed a BOAD‑trained orchestrator to automatically decompose bug reports, suggest targeted edits, and run verification, reducing the cognitive load on engineers.
CI/CD Automation: Teams can plug BOAD‑derived agents into continuous integration pipelines to automatically triage failing tests, generate patches, and validate them before human review.
Cost‑Effective AI Ops: Because BOAD learns to use specialized sub‑agents, the overall number of expensive LLM calls per issue drops, translating to lower API bills for enterprises.
Extensibility: New sub‑agents (e.g., security scanner, performance profiler) can be added to the candidate pool; the bandit will automatically assess their utility without hand‑tuning.

Limitations & Future Work

Candidate Dependence: BOAD can only discover hierarchies from the predefined pool of sub‑agents; truly novel capabilities require manual prompt engineering.
Scalability of Hierarchy Depth: The current implementation caps hierarchy depth to keep the bandit tractable; deeper, more complex pipelines may need hierarchical bandits or reinforcement learning.
Reward Noise: Success metrics (e.g., test pass) can be noisy for ambiguous tasks, potentially misleading the bandit; richer reward signals (e.g., code quality metrics) are a promising direction.
Human‑in‑the‑Loop Evaluation: Real‑world adoption will need studies on how developers interact with automatically generated hierarchies and trust the suggested fixes.

BOAD showcases a compelling path toward modular, self‑optimizing AI assistants for software engineering, bridging the gap between powerful LLMs and the structured, collaborative workflows that human developers rely on.

Authors

Iris Xu
Guangtao Zeng
Zexue He
Charles Jin
Aldo Pareja
Dan Gutfreund
Chuang Gan
Zhang‑Wei Hong

Paper Information

arXiv ID: 2512.23631v1
Categories: cs.LG, cs.AI
Published: December 29, 2025
PDF: Download PDF

[Paper] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management