How Sakana trained a 7B model to orchestrate GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro
Source: VentureBeat
The Problem with Hard‑Coded LangChain Pipelines
“Every LangChain pipeline your team hardcodes starts breaking the moment the query distribution shifts — and it always shifts.”
That bottleneck is exactly what Sakana AI set out to eliminate.
Introducing the RL Conductor
Sakana AI’s researchers have built a small language model trained via reinforcement learning (RL) to automatically orchestrate a diverse pool of worker LLMs.
- Dynamic analysis of each input
- Labor distribution across multiple workers
- Coordination among agents
The result: state‑of‑the‑art performance on hard reasoning and coding benchmarks, outperforming frontier models such as GPT‑5 and Claude Sonnet 4, as well as expensive human‑designed multi‑agent pipelines. All of this is achieved at a fraction of the cost and with fewer API calls than competing solutions.
The RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi‑agent orchestration service.
The Limitations of Manual Agentic Frameworks
Large language models possess strong latent capabilities, but tapping those capabilities fully remains a major challenge. Current commercial AI products rely heavily on manually designed agentic workflows, which suffer from several fundamental issues:
-
Rigidity & Constrained Design
- Hard‑coded pipelines (e.g., LangChain, Mixture‑of‑Agents) work for narrow use‑cases but break in production when user demands become heterogeneous.
-
Quote from the Authors
“While using frameworks with hard‑coded pipelines like LangChain and Mixture‑of‑Agents can work well for specific use cases … In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands.”
— Yujin Tang, co‑author (VentureBeat interview)Tang adds: “Real‑world generalization in such heterogeneous applications inherently necessitates going beyond human‑hardcoded designs.”
-
No Single Model Is Optimal for All Tasks
- Different models excel in different domains (scientific reasoning, code generation, mathematical logic, high‑level planning, etc.).
- Manually predicting and hard‑coding the ideal model combination for every query is practically impossible.
An optimal agentic framework should:
- Analyze a problem automatically.
- Delegate subtasks to the most suitable expert in the pool.
Conducting an Orchestra of Agents
The RL Conductor was built to overcome the above limitations. As its name suggests, it conducts an orchestra of agents by:
- Dividing challenging problems into subtasks.
- Delegating each subtask to a targeted worker LLM.
- Designing communication topologies (who sees which prior outputs).
How It Works
- Natural‑language workflow generation: For each step, the Conductor emits a plain‑English instruction, assigns an agent, and creates an access list that specifies which previous subtasks and responses are included in that agent’s context.
- Flexible structures:
- Simple sequential chains
- Parallel tree structures
- Recursive loops (when needed)
All of this is learned via reinforcement learning rather than hand‑crafted rules:
| Training Signal | What It Optimizes |
|---|---|
| Task + worker pool | Correct answer & proper output format |
| Reward (binary/graded) | Maximizing task success |
Through trial‑and‑error, the Conductor discovers advanced orchestration strategies such as:
- Targeted prompt engineering
- Iterative refinement
- Meta‑prompt optimization
Thus, the model dynamically adjusts its strategy and leverages each worker’s strengths without any human‑coded routing logic.
Conductor in Action
Experimental Setup
- Base model: 7‑billion‑parameter Qwen2.5‑7B fine‑tuned with the RL Conductor framework.
- Worker pool (7 models):
- Closed‑source giants: Gemini 2.5 Pro, Claude‑Sonnet‑4, GPT‑5
- Open‑source models: DeepSeek‑R1‑Distill‑Qwen‑32B, Gemma3‑27B, Qwen3‑32B, plus one additional model.
The Conductor was tasked with designing agentic workflows of up to five steps.
Benchmarks & Results
| Benchmark | Score (Conductor) | Comparison |
|---|---|---|
| Overall average | 77.27 % | New state‑of‑the‑art |
| AIME25 (math) | 93.3 % | Highest reported |
| GPQA‑Diamond | 87.5 % | — |
| LiveCodeBench | 83.93 % | — |
Efficiency
- Tokens per question:
- Baseline MoA: 11,203 tokens
- RL Conductor: 1,820 tokens (≈ 6× fewer)
- Average workflow steps: 3
Why It Works
-
Task‑difficulty awareness:
- Simple factual queries → single‑step or two‑agent workflows.
- Complex coding problems → up to four agents (planning, implementation, verification, etc.).
-
Model‑strength exploitation:
- The Conductor learns that frontier models have complementary strengths and routes subtasks accordingly (e.g., using Claude‑Sonnet‑4 for reasoning, Gemini 2.5 Pro for code synthesis).
Takeaways
- Hard‑coded pipelines are brittle in the face of shifting query distributions.
- RL Conductor demonstrates that a small, RL‑trained model can dynamically orchestrate a heterogeneous pool of LLMs, achieving superior accuracy and dramatically lower token usage.
- The approach paves the way for scalable, cost‑effective multi‑agent services like Fugu, moving beyond the limits of manual agentic designs.
Conductor‑Driven Benchmark Success
To achieve record scores on coding benchmarks, the Conductor frequently assigned Gemini 2.5 Pro and Claude Sonnet 4 to act as high‑level planners, bringing in GPT‑5 only at the very end to write the final optimized code.
In a particularly clever display of adaptability, the Conductor would sometimes abdicate its own role entirely, handing the entire planning process over to Gemini 2.5 Pro and allowing it to dictate the subtasks for the rest of the model pool.
Beyond Benchmarks – Enterprise Utility
“We have been using our Fugu models—built on Conductor technology—internally for a range of practical enterprise applications: software development, deep research, strategy development, and even visual tasks like slide generation,”
— Yujin Tang
Bringing Orchestration to the Enterprise: Sakana Fugu
- The 7B model described in the research paper was an exploratory blueprint and is not publicly available.
- Sakana AI has productized the Conductor framework into its flagship commercial AI product, Sakana Fugu.
Current Status
- Beta phase
- Serves as a multi‑agent orchestration system accessible through a standard OpenAI‑compatible API.
Target Market
“Fugu targets the large market of industries where AI adoption has yet to bring large productivity gains due to the generalization limitations of current hard‑coded pipelines, such as finance and defense.”
— Tang
Benefits for Enterprise Developers
- Seamless integration into existing applications without managing multiple API keys or manually routing tasks across different vendors.
- Behind the API, Fugu automates complex collaboration topologies and role assignments across a pool of models.
Product Variants
| Variant | Purpose | Key Characteristics |
|---|---|---|
| Fugu Mini | Low‑latency operations | Optimized for speed, suitable for real‑time use cases |
| Fugu Ultra | Maximum performance on demanding workloads | Scales to heavy computational loads, best for large‑scale tasks |
Governance & Interpretability
- Tang notes that interpretability risks are functionally similar to the hidden reasoning traces of current top‑tier closed APIs.
- The system is managed with established guardrails to minimize hallucinations.
When to Use RL‑Orchestration vs. Traditional Routing
“The absolute sweet spot comes whenever users and their teams feel they are spending a disproportionate amount of time guiding their underlying agents,”
— Tang
- Caution: The framework isn’t necessary for every scenario.
- Economic note: “It’s hard to beat the economic proposition of a local model running directly on the user’s machine for simple queries.”
Looking Ahead
- As the diversity of specialized open‑ and closed‑source AI models continues to grow, static hard‑coded pipelines will become obsolete.
- Dynamic orchestration is expected to extend beyond text and code.
“There is indeed a large potential to fill this gap with cross‑modal Conductor frameworks becoming the foundation for more autonomous, self‑coordinating physical AI systems.”
— Tang