The Invisible Orchestrator: Cheap Routing + Expensive Reasoning in Multi-Agent Apps

Published: 2 days ago (April 17, 2026 at 03:05 PM EDT)

5 min read

Source: Dev.to

Problem Overview

We built SamiWISE, a GMAT prep tutor that uses four specialist AI agents:

Agent	Domain
quant	Quantitative reasoning (arithmetic, algebra, geometry, word problems, number properties)
verbal	Reading comprehension, critical reasoning, sentence correction
data_insights	Table analysis, multi‑source reasoning, two‑part analysis
strategy	Timing, test‑taking approach, score targets, study‑plan questions

Each agent has its own system prompt, a dedicated Pinecone namespace, and a distinct reasoning style.
The challenge: route every user message to the correct specialist without adding noticeable latency.

Running every message through GPT‑4o to decide the route added 800–1,200 ms of delay before the first token appeared—unacceptable for a tutoring app where response feel matters.

Initial Routing Attempt

// First attempt — routing via GPT‑4o
const routingResponse = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: `You are a routing agent. Given a user message, return ONLY a JSON object:
{"agent": "quant" | "verbal" | "data_insights" | "strategy"}
No explanation. No other text.`,
    },
    { role: "user", content: userMessage },
  ],
  response_format: { type: "json_object" },
});

const { agent } = JSON.parse(routingResponse.choices[0].message.content!);
// then call the specialist...

Problems

Latency – GPT‑4o needed 400–1,200 ms just to return a tiny JSON payload.
Cost – Every user message incurred two LLM calls (router + specialist), increasing per‑message AI cost by ~35 %.
Over‑engineering – The router only needs to output one of four tokens; frontier reasoning ability is unnecessary.

Switching to a Faster Model

We replaced the GPT‑4o router with Groq running llama-3.3-70b-versatile. The prompt and JSON output format stayed the same, but median routing latency dropped from ≈850 ms to ≈55 ms.

// lib/openai/client.ts
import Groq from "groq-sdk";

export const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

Routing Implementation

// agents/gmat/orchestrator.ts — routing call
async function routeToAgent(
  userMessage: string,
  conversationContext: string
): Promise<string> {
  const response = await groq.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [
      {
        role: "system",
        content: `Route the user message to one specialist agent.
Return ONLY valid JSON: {"agent": "quant" | "verbal" | "data_insights" | "strategy"}

Routing rules:
- quant: arithmetic, algebra, geometry, word problems, number properties
- verbal: reading comprehension, critical reasoning, sentence correction  
- data_insights: table analysis, multi-source reasoning, two-part analysis
- strategy: timing, test-taking approach, score targets, study plan questions

Context (last 2 messages):
${conversationContext}`,
      },
      { role: "user", content: userMessage },
    ],
    response_format: { type: "json_object" },
    temperature: 0,      // deterministic routing
    max_tokens: 20,      // safeguard against extra text
  });

  const result = JSON.parse(response.choices[0].message.content!);

  // Validate result; fall back to "quant" if unexpected
  const valid = ["quant", "verbal", "data_insights", "strategy"] as const;
  return valid.includes(result.agent) ? result.agent : "quant";
}

The specialist agents still use GPT‑4o with full streaming. Because the routing call finishes in ~55 ms, the user never perceives a gap before the first streaming token arrives.

Orchestration Flow

// agents/gmat/orchestrator.ts — simplified main flow
export async function handleMessage(
  userMessage: string,
  userId: string,
  stream: ReadableStreamDefaultController
) {
  // 1. Build routing context from last 2 messages (~5 ms, local)
  const context = await getRecentContext(userId);

  // 2. Route via Groq — fast, cheap, deterministic (~55 ms)
  const agentType = await routeToAgent(userMessage, context);

  // 3. Load specialist config and RAG context in parallel
  const [agentConfig, ragContext] = await Promise.all([
    getAgentConfig(agentType),
    fetchRAGContext(userMessage, agentType), // hits the right Pinecone namespace
  ]);

  // 4. Stream response from GPT‑4o specialist
  await streamSpecialistResponse(
    userMessage,
    agentConfig,
    ragContext,
    userId,
    stream
  );
}

Steps 3 and 4 overlap with the routing call’s processing time, so the real first‑token latency from user submit to visible character is about ≈900 ms.

Key Parameters

Parameter	Reason
`temperature: 0`	Guarantees deterministic routing; higher values caused drift on ambiguous messages.
`max_tokens: 20`	Prevents the model from appending free‑text after the JSON, ensuring parsable output.
`response_format: {type: "json_object"}`	Enforces strict JSON, simplifying downstream parsing.

Results & Observations

Error rate – Groq’s llama model mis‑routed only 3 % of edge cases, compared with 8 % for GPT‑4o‑mini.
Speed – Median routing latency of 55 ms vs. 850 ms with GPT‑4o.
Cost – Routing with Groq is significantly cheaper per call, reducing overall per‑message AI cost.
Determinism – Temperature 0 proved essential; even a slight increase (0.2) introduced routing drift over time.

The routing/reasoning split is a pattern, not a hack. It can be reused for any scenario where a cheap, fast classification precedes an expensive generative response (e.g., intent detection, form‑field extraction).

What’s Next

Confidence scoring – Return a confidence metric and fall back to a clarifying question when uncertain.
Context‑aware routing – Weight recent topic more heavily in multi‑turn conversations.
Routing analytics – Track when users re‑ask or correct a mis‑routed response to improve the routing prompt over time.

Discussion Questions

How do you handle routing in multi‑agent systems? Do you use a separate model or rely on the primary LLM via function calling?
Has anyone benchmarked other fast inference providers (Cerebras, Together, Fireworks) against Groq for structural routing tasks?
When routing confidence is low, do you ask the user to clarify or make a best guess and let them redirect if wrong?

The Invisible Orchestrator: Cheap Routing + Expensive Reasoning in Multi-Agent Apps

Problem Overview

Initial Routing Attempt

Problems

Switching to a Faster Model

Routing Implementation

Orchestration Flow

Key Parameters

Results & Observations

What’s Next

Discussion Questions

Related posts

Launch Day: 7 AI Agents Start Building Startups with $100 Each

The Future: Engineers as AI System Architects

FinOps for AI vs Traditional FinOps: Key Differences Explained

If AI Finally Writes 90% of Code, You Don't Need to Learn So Many Languages