[Paper] AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Published: 3 months ago (February 3, 2026 at 12:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03786v1

Overview

The paper introduces AOrchestra, a new framework that lets a central “orchestrator” automatically spin up specialized sub‑agents on the fly to tackle complex, multi‑step tasks. By representing any agent as a simple Instruction‑Context‑Tools‑Model tuple, AOrchestra can dynamically assemble the right mix of knowledge, tools, and language model at each step, dramatically reducing the hand‑crafted engineering that current multi‑agent systems require.

Key Contributions

Unified agent abstraction: A generic tuple (Instruction, Context, Tools, Model) that captures the essence of any language‑based agent, independent of the underlying LLM or toolset.
Dynamic sub‑agent creation: The orchestrator can instantiate task‑specific agents at runtime, selecting the most appropriate tools and models without human intervention.
Framework‑agnostic design: Plug‑and‑play support for heterogeneous agents (e.g., code‑generation bots, web‑search assistants, terminal controllers) across different AI stacks.
Performance‑cost trade‑off control: AOrchestra can balance accuracy and latency by choosing cheaper or more powerful models per sub‑task, moving toward Pareto‑efficient operation.
Empirical gains: On three demanding benchmarks (GAIA, SWE‑Bench, Terminal‑Bench) the system improves task success rates by ~16 % relative over the strongest baselines when using Gemini‑3‑Flash.

Methodology

Agent Tuple Definition
- Instruction – the high‑level goal or prompt for the agent.
- Context – any relevant data (code snippets, prior conversation, file system state).
- Tools – external utilities the agent may call (search APIs, code compilers, shell commands).
- Model – the LLM that will generate the agent’s next action.
Orchestrator Loop
- Context Curation: Pulls the most relevant pieces of information for the current step (e.g., recent logs, retrieved docs).
- Tool & Model Selection: Uses a lightweight policy model to decide which tools and which LLM size to employ, based on estimated difficulty and cost.
- Sub‑Agent Instantiation: Constructs a concrete agent from the tuple and runs it to produce an action (e.g., “run this command”, “write this function”).
- Feedback Integration: The orchestrator observes the outcome, updates its internal state, and repeats until the overall task is solved or a termination condition is met.
Plug‑and‑Play Integration
- Developers can register new tool wrappers or LLM back‑ends via a simple interface; the orchestrator automatically discovers and incorporates them.
Training & Evaluation
- The selection policy is trained on a mixture of synthetic and real task traces, optimizing for success rate while penalizing expensive model calls.

Results & Findings

Benchmark	Baseline (Gemini‑3‑Flash)	AOrchestra (Gemini‑3‑Flash)	Relative Gain
GAIA	68.4 %	78.5 %	+14.8 %
SWE‑Bench	55.2 %	63.1 %	+14.3 %
Terminal‑Bench	61.7 %	71.9 %	+16.6 %

Higher success rates stem from the orchestrator’s ability to pick a lightweight model for simple sub‑tasks and reserve the heavyweight Gemini‑3‑Flash for the hardest steps.
Cost efficiency: Average token usage per task dropped by ~12 % compared to a static‑model baseline, confirming the controllable performance‑cost trade‑off.
Robustness: The system handled previously unseen tool combinations without additional fine‑tuning, showcasing the power of the abstraction.

Practical Implications

Reduced engineering overhead: Teams no longer need to hand‑craft a hierarchy of specialized bots for each workflow; AOrchestra auto‑generates them as needed.
Scalable AI assistants: Enterprises can deploy a single orchestrator that dynamically adapts to new internal tools (e.g., CI pipelines, proprietary APIs) without rewriting prompts.
Cost‑aware AI services: Cloud providers can expose a “smart orchestration” endpoint that automatically balances latency and price, offering developers predictable billing.
Rapid prototyping: Developers can experiment with novel tool‑LLM pairings by simply registering a new tool wrapper, letting the orchestrator discover the best usage pattern.

Limitations & Future Work

Selection policy opacity: The current policy model is a black‑box neural network, making it hard to audit why a particular tool or model was chosen.
Tool reliability assumptions: AOrchestra assumes registered tools behave deterministically; flaky external services can degrade performance.
Benchmark scope: While GAIA, SWE‑Bench, and Terminal‑Bench are diverse, they still represent a limited slice of real‑world enterprise workflows.
Future directions: The authors plan to (1) incorporate explainable decision‑making for tool/model selection, (2) explore reinforcement‑learning‑based adaptation to live feedback, and (3) extend the framework to multi‑orchestrator collaborations for truly distributed AI pipelines.

Authors

Jianhao Ruan
Zhihao Xu
Yiran Peng
Fashen Ren
Zhaoyang Yu
Xinbing Liang
Jinyu Xiang
Bang Liu
Chenglin Wu
Yuyu Luo
Jiayi Zhang

Paper Information

arXiv ID: 2602.03786v1
Categories: cs.AI, cs.CL
Published: February 3, 2026
PDF: Download PDF

[Paper] AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

The Machine Learning Lessons I’ve Learned Last Month

[Paper] Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

GPT-5.3-Codex

Fundamental emerges from stealth with first major foundation model trained for tabular data