The Invisible Orchestrator: Cheap Routing + Expensive Reasoning in Multi-Agent Apps

Published: (April 17, 2026 at 03:05 PM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Problem Overview

We built SamiWISE, a GMAT prep tutor that uses four specialist AI agents:

AgentDomain
quantQuantitative reasoning (arithmetic, algebra, geometry, word problems, number properties)
verbalReading comprehension, critical reasoning, sentence correction
data_insightsTable analysis, multi‑source reasoning, two‑part analysis
strategyTiming, test‑taking approach, score targets, study‑plan questions

Each agent has its own system prompt, a dedicated Pinecone namespace, and a distinct reasoning style.
The challenge: route every user message to the correct specialist without adding noticeable latency.

Running every message through GPT‑4o to decide the route added 800–1,200 ms of delay before the first token appeared—unacceptable for a tutoring app where response feel matters.

Initial Routing Attempt

// First attempt — routing via GPT‑4o
const routingResponse = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: `You are a routing agent. Given a user message, return ONLY a JSON object:
{"agent": "quant" | "verbal" | "data_insights" | "strategy"}
No explanation. No other text.`,
    },
    { role: "user", content: userMessage },
  ],
  response_format: { type: "json_object" },
});

const { agent } = JSON.parse(routingResponse.choices[0].message.content!);
// then call the specialist...

Problems

  1. Latency – GPT‑4o needed 400–1,200 ms just to return a tiny JSON payload.
  2. Cost – Every user message incurred two LLM calls (router + specialist), increasing per‑message AI cost by ~35 %.
  3. Over‑engineering – The router only needs to output one of four tokens; frontier reasoning ability is unnecessary.

Switching to a Faster Model

We replaced the GPT‑4o router with Groq running llama-3.3-70b-versatile. The prompt and JSON output format stayed the same, but median routing latency dropped from ≈850 ms to ≈55 ms.

// lib/openai/client.ts
import Groq from "groq-sdk";

export const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

Routing Implementation

// agents/gmat/orchestrator.ts — routing call
async function routeToAgent(
  userMessage: string,
  conversationContext: string
): Promise<string> {
  const response = await groq.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [
      {
        role: "system",
        content: `Route the user message to one specialist agent.
Return ONLY valid JSON: {"agent": "quant" | "verbal" | "data_insights" | "strategy"}

Routing rules:
- quant: arithmetic, algebra, geometry, word problems, number properties
- verbal: reading comprehension, critical reasoning, sentence correction  
- data_insights: table analysis, multi-source reasoning, two-part analysis
- strategy: timing, test-taking approach, score targets, study plan questions

Context (last 2 messages):
${conversationContext}`,
      },
      { role: "user", content: userMessage },
    ],
    response_format: { type: "json_object" },
    temperature: 0,      // deterministic routing
    max_tokens: 20,      // safeguard against extra text
  });

  const result = JSON.parse(response.choices[0].message.content!);

  // Validate result; fall back to "quant" if unexpected
  const valid = ["quant", "verbal", "data_insights", "strategy"] as const;
  return valid.includes(result.agent) ? result.agent : "quant";
}

The specialist agents still use GPT‑4o with full streaming. Because the routing call finishes in ~55 ms, the user never perceives a gap before the first streaming token arrives.

Orchestration Flow

// agents/gmat/orchestrator.ts — simplified main flow
export async function handleMessage(
  userMessage: string,
  userId: string,
  stream: ReadableStreamDefaultController
) {
  // 1. Build routing context from last 2 messages (~5 ms, local)
  const context = await getRecentContext(userId);

  // 2. Route via Groq — fast, cheap, deterministic (~55 ms)
  const agentType = await routeToAgent(userMessage, context);

  // 3. Load specialist config and RAG context in parallel
  const [agentConfig, ragContext] = await Promise.all([
    getAgentConfig(agentType),
    fetchRAGContext(userMessage, agentType), // hits the right Pinecone namespace
  ]);

  // 4. Stream response from GPT‑4o specialist
  await streamSpecialistResponse(
    userMessage,
    agentConfig,
    ragContext,
    userId,
    stream
  );
}

Steps 3 and 4 overlap with the routing call’s processing time, so the real first‑token latency from user submit to visible character is about ≈900 ms.

Key Parameters

ParameterReason
temperature: 0Guarantees deterministic routing; higher values caused drift on ambiguous messages.
max_tokens: 20Prevents the model from appending free‑text after the JSON, ensuring parsable output.
response_format: {type: "json_object"}Enforces strict JSON, simplifying downstream parsing.

Results & Observations

  • Error rate – Groq’s llama model mis‑routed only 3 % of edge cases, compared with 8 % for GPT‑4o‑mini.
  • Speed – Median routing latency of 55 ms vs. 850 ms with GPT‑4o.
  • Cost – Routing with Groq is significantly cheaper per call, reducing overall per‑message AI cost.
  • Determinism – Temperature 0 proved essential; even a slight increase (0.2) introduced routing drift over time.

The routing/reasoning split is a pattern, not a hack. It can be reused for any scenario where a cheap, fast classification precedes an expensive generative response (e.g., intent detection, form‑field extraction).

What’s Next

  • Confidence scoring – Return a confidence metric and fall back to a clarifying question when uncertain.
  • Context‑aware routing – Weight recent topic more heavily in multi‑turn conversations.
  • Routing analytics – Track when users re‑ask or correct a mis‑routed response to improve the routing prompt over time.

Discussion Questions

  • How do you handle routing in multi‑agent systems? Do you use a separate model or rely on the primary LLM via function calling?
  • Has anyone benchmarked other fast inference providers (Cerebras, Together, Fireworks) against Groq for structural routing tasks?
  • When routing confidence is low, do you ask the user to clarify or make a best guess and let them redirect if wrong?
0 views
Back to Blog

Related posts

Read more »