[Paper] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Published: (December 29, 2025 at 12:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23631v1

Overview

The paper introduces BOAD (Bandit Optimization for Agent Design), a framework that automatically builds hierarchical, multi‑agent software‑engineering assistants. By treating each possible sub‑agent (e.g., bug localizer, code editor, test validator) as an arm in a multi‑armed bandit, BOAD learns how to compose a team of specialists that outperforms monolithic LLM agents on real‑world, long‑horizon coding tasks.

Key Contributions

  • Hierarchical Agent Discovery: Formulates the problem of finding effective sub‑agent hierarchies as a multi‑armed bandit (MAB) optimization, enabling scalable exploration of combinatorial design spaces.
  • Credit Assignment Mechanism: Introduces a reward signal that measures each sub‑agent’s “helpfulness” when collaborating, solving the classic credit‑assignment challenge in multi‑agent teams.
  • BOAD Framework: Provides an end‑to‑end pipeline (candidate generation → bandit‑driven selection → joint evaluation) that works under strict evaluation budgets.
  • Empirical Gains: Demonstrates state‑of‑the‑art performance on SWE‑bench‑Verified and SWE‑bench‑Live, beating strong baselines including single‑agent LLMs and manually crafted multi‑agent systems.
  • Open‑Source Release: Shares code, prompts, and evaluation scripts, facilitating reproducibility and community extensions.

Methodology

  1. Candidate Sub‑Agent Pool:

    • Start with a set of prompt templates and tool‑wrappers (e.g., “find the buggy file”, “apply a diff”, “run unit tests”).
    • Each template + LLM pair constitutes a candidate arm.
  2. Bandit Formulation:

    • Arms: Individual candidate sub‑agents.
    • Pull: Assemble a hierarchy (orchestrator + selected sub‑agents) and run it on a SWE task.
    • Reward: A composite score combining task success (e.g., passing verification tests) and efficiency (e.g., number of LLM calls).
  3. Exploration‑Exploitation Loop:

    • Use a contextual MAB algorithm (e.g., Thompson Sampling) to balance trying new sub‑agents vs. exploiting known good ones.
    • After each evaluation, update posterior beliefs about each arm’s utility, which directly influences future hierarchy proposals.
  4. Orchestrator Design:

    • A lightweight controller decides the execution order (localize → edit → validate) based on the current hierarchy.
    • The orchestrator itself can be a simple rule‑based script; the novelty lies in the automatically discovered sub‑agents it coordinates.
  5. Evaluation Budget:

    • The bandit runs under a fixed number of total task evaluations (e.g., a few thousand), reflecting realistic constraints on LLM API usage.

Results & Findings

BenchmarkBOAD (36B)Single‑Agent 36BManually‑Designed Multi‑AgentGPT‑4Claude
SWE‑bench‑Verified+12.4% pass rate over single‑agent+6.8% over manual multi‑agent
SWE‑bench‑Live (out‑of‑distribution)2nd place on leaderboard (≈ 1.8% behind top)Lower than BOAD despite larger model sizeLower than BOAD
  • Generalization: BOAD’s hierarchies maintain higher success rates on newer, unseen issues, indicating better robustness to distribution shift.
  • Efficiency: The bandit discovers useful sub‑agents using ~30% fewer LLM calls than exhaustive grid search.
  • Ablation: Removing the credit‑assignment reward or limiting the hierarchy depth drops performance back to single‑agent levels, confirming the importance of both components.

Practical Implications

  • Developer Tooling: IDE plugins could embed a BOAD‑trained orchestrator to automatically decompose bug reports, suggest targeted edits, and run verification, reducing the cognitive load on engineers.
  • CI/CD Automation: Teams can plug BOAD‑derived agents into continuous integration pipelines to automatically triage failing tests, generate patches, and validate them before human review.
  • Cost‑Effective AI Ops: Because BOAD learns to use specialized sub‑agents, the overall number of expensive LLM calls per issue drops, translating to lower API bills for enterprises.
  • Extensibility: New sub‑agents (e.g., security scanner, performance profiler) can be added to the candidate pool; the bandit will automatically assess their utility without hand‑tuning.

Limitations & Future Work

  • Candidate Dependence: BOAD can only discover hierarchies from the predefined pool of sub‑agents; truly novel capabilities require manual prompt engineering.
  • Scalability of Hierarchy Depth: The current implementation caps hierarchy depth to keep the bandit tractable; deeper, more complex pipelines may need hierarchical bandits or reinforcement learning.
  • Reward Noise: Success metrics (e.g., test pass) can be noisy for ambiguous tasks, potentially misleading the bandit; richer reward signals (e.g., code quality metrics) are a promising direction.
  • Human‑in‑the‑Loop Evaluation: Real‑world adoption will need studies on how developers interact with automatically generated hierarchies and trust the suggested fixes.

BOAD showcases a compelling path toward modular, self‑optimizing AI assistants for software engineering, bridging the gap between powerful LLMs and the structured, collaborative workflows that human developers rely on.

Authors

  • Iris Xu
  • Guangtao Zeng
  • Zexue He
  • Charles Jin
  • Aldo Pareja
  • Dan Gutfreund
  • Chuang Gan
  • Zhang‑Wei Hong

Paper Information

  • arXiv ID: 2512.23631v1
  • Categories: cs.LG, cs.AI
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »