[Paper] AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration
Source: arXiv - 2602.03786v1
Overview
The paper introduces AOrchestra, a new framework that lets a central “orchestrator” automatically spin up specialized sub‑agents on the fly to tackle complex, multi‑step tasks. By representing any agent as a simple Instruction‑Context‑Tools‑Model tuple, AOrchestra can dynamically assemble the right mix of knowledge, tools, and language model at each step, dramatically reducing the hand‑crafted engineering that current multi‑agent systems require.
Key Contributions
- Unified agent abstraction: A generic tuple (Instruction, Context, Tools, Model) that captures the essence of any language‑based agent, independent of the underlying LLM or toolset.
- Dynamic sub‑agent creation: The orchestrator can instantiate task‑specific agents at runtime, selecting the most appropriate tools and models without human intervention.
- Framework‑agnostic design: Plug‑and‑play support for heterogeneous agents (e.g., code‑generation bots, web‑search assistants, terminal controllers) across different AI stacks.
- Performance‑cost trade‑off control: AOrchestra can balance accuracy and latency by choosing cheaper or more powerful models per sub‑task, moving toward Pareto‑efficient operation.
- Empirical gains: On three demanding benchmarks (GAIA, SWE‑Bench, Terminal‑Bench) the system improves task success rates by ~16 % relative over the strongest baselines when using Gemini‑3‑Flash.
Methodology
-
Agent Tuple Definition
- Instruction – the high‑level goal or prompt for the agent.
- Context – any relevant data (code snippets, prior conversation, file system state).
- Tools – external utilities the agent may call (search APIs, code compilers, shell commands).
- Model – the LLM that will generate the agent’s next action.
-
Orchestrator Loop
- Context Curation: Pulls the most relevant pieces of information for the current step (e.g., recent logs, retrieved docs).
- Tool & Model Selection: Uses a lightweight policy model to decide which tools and which LLM size to employ, based on estimated difficulty and cost.
- Sub‑Agent Instantiation: Constructs a concrete agent from the tuple and runs it to produce an action (e.g., “run this command”, “write this function”).
- Feedback Integration: The orchestrator observes the outcome, updates its internal state, and repeats until the overall task is solved or a termination condition is met.
-
Plug‑and‑Play Integration
- Developers can register new tool wrappers or LLM back‑ends via a simple interface; the orchestrator automatically discovers and incorporates them.
-
Training & Evaluation
- The selection policy is trained on a mixture of synthetic and real task traces, optimizing for success rate while penalizing expensive model calls.
Results & Findings
| Benchmark | Baseline (Gemini‑3‑Flash) | AOrchestra (Gemini‑3‑Flash) | Relative Gain |
|---|---|---|---|
| GAIA | 68.4 % | 78.5 % | +14.8 % |
| SWE‑Bench | 55.2 % | 63.1 % | +14.3 % |
| Terminal‑Bench | 61.7 % | 71.9 % | +16.6 % |
- Higher success rates stem from the orchestrator’s ability to pick a lightweight model for simple sub‑tasks and reserve the heavyweight Gemini‑3‑Flash for the hardest steps.
- Cost efficiency: Average token usage per task dropped by ~12 % compared to a static‑model baseline, confirming the controllable performance‑cost trade‑off.
- Robustness: The system handled previously unseen tool combinations without additional fine‑tuning, showcasing the power of the abstraction.
Practical Implications
- Reduced engineering overhead: Teams no longer need to hand‑craft a hierarchy of specialized bots for each workflow; AOrchestra auto‑generates them as needed.
- Scalable AI assistants: Enterprises can deploy a single orchestrator that dynamically adapts to new internal tools (e.g., CI pipelines, proprietary APIs) without rewriting prompts.
- Cost‑aware AI services: Cloud providers can expose a “smart orchestration” endpoint that automatically balances latency and price, offering developers predictable billing.
- Rapid prototyping: Developers can experiment with novel tool‑LLM pairings by simply registering a new tool wrapper, letting the orchestrator discover the best usage pattern.
Limitations & Future Work
- Selection policy opacity: The current policy model is a black‑box neural network, making it hard to audit why a particular tool or model was chosen.
- Tool reliability assumptions: AOrchestra assumes registered tools behave deterministically; flaky external services can degrade performance.
- Benchmark scope: While GAIA, SWE‑Bench, and Terminal‑Bench are diverse, they still represent a limited slice of real‑world enterprise workflows.
- Future directions: The authors plan to (1) incorporate explainable decision‑making for tool/model selection, (2) explore reinforcement‑learning‑based adaptation to live feedback, and (3) extend the framework to multi‑orchestrator collaborations for truly distributed AI pipelines.
Authors
- Jianhao Ruan
- Zhihao Xu
- Yiran Peng
- Fashen Ren
- Zhaoyang Yu
- Xinbing Liang
- Jinyu Xiang
- Bang Liu
- Chenglin Wu
- Yuyu Luo
- Jiayi Zhang
Paper Information
- arXiv ID: 2602.03786v1
- Categories: cs.AI, cs.CL
- Published: February 3, 2026
- PDF: Download PDF