Beyond Keywords: Engineering a Production-Ready Agentic Search Framework in Go

Published: 1 hour ago (December 29, 2025 at 05:00 AM EST)

5 min read

Source: Dev.to

Cover image for Beyond Keywords: Engineering a Production-Ready Agentic Search Framework in Go

Search Systems and Agentic Search

Search systems have historically been optimized for retrieval: given a query, return the most relevant documents. That model breaks down the moment user intent shifts from finding information to solving problems.

Consider a query like:

“How will tomorrow’s weather in Seattle affect flight prices to JFK?”

This isn’t a search problem. It’s a reasoning problem — one that requires decomposition, orchestration across multiple systems, and synthesis into a coherent answer.

This is where agentic search comes in.

In this article, I’ll walk through how we designed and production‑ized an agentic search framework in Go — not as a demo, but as a real system operating under production constraints like latency, cost, concurrency, and failure modes.

From Search to Agentic Search

Keyword and vector search systems excel at matching queries to documents. What they don’t handle well is:

Multi‑step reasoning
Tool coordination
Query decomposition
Answer synthesis

Agentic search treats the LLM not as a text generator, but as a planner — a component that decides what actions to take to answer a question.

At a high level, an agentic system must be able to:

Understand user intent
Decide which tools to call
Execute those tools safely
Iterate when necessary
Synthesize a final response

The hard part isn’t wiring an LLM to tools. The hard part is doing this predictably and economically in production.

High‑Level Architecture

We structured the system around three core concerns:

Concern	Responsibility
Planning	Deciding what to do
Execution	Running tools efficiently
Synthesis	Producing the final answer

Here’s the end‑to‑end flow:

User Query → Planner → Tool Registry → Tool Execution → Response Generator → SSE Stream → User

Each stage is deliberately isolated. Reasoning does not leak into execution, and execution does not influence planning decisions directly.

Flow Orchestrator: The Control Plane

The Flow Orchestrator manages the full lifecycle of a request. Its responsibilities include:

Coordinating planner invocations
Executing tools concurrently
Handling retries, timeouts, and cancellations
Streaming partial responses

Instead of a linear pipeline, the orchestrator supports parallel execution using Go’s goroutines. This becomes essential once multiple independent tools are involved.

Query Planner: Mandatory First Pass, Conditional Iteration

The Query Planner is always invoked at least once.

First Planner Call (Always)

On the first invocation, the planner:

Analyzes the user query
Produces an initial set of tool calls
Establishes a consistent reasoning baseline

Even trivial queries go through this step to maintain uniform behavior and observability.

Lightweight Classifier Gate

Before invoking the planner a second time, we run a lightweight classifier model to determine whether the query is:

Single‑step
Multi‑step

This classifier is intentionally cheap and fast.

Second Planner Call (Only for Multi‑Step Queries)

If the query is classified as multi‑step:

The planner is invoked again.
It receives:
- The original user query
- Tool responses from the first execution
It determines:
- Whether more tools are required
- Which tools to call next
- How to sequence them

This prevents uncontrolled planner loops — one of the most common failure modes in agentic systems.

Tool Registry: Where Reasoning Meets Reality

Every tool implements a strict Go interface:

// ToolInterface is the tool interface for developers to implement which uses
// generics with strongly typed
type ToolInterface[Input any, Output any] interface {
    // Execute initiates the execution of a tool.
    //
    // Parameters:
    // - ctx:        Context for cancellation/timeout.
    // - requestContext: Additional request‑specific data.
    // - input:      Strong‑typed tool request input.
    // Returns:
    // - output:     Strong‑typed tool request output.
    // - toolContext: Additional output data not used by the agent model.
    // - err:        Structured error from tool (e.g., no_response).
    Execute(ctx context.Context, requestContext *RequestContext, input Input) (output Output, toolContext ToolResponseContext, err error)

    // GetDefinition gets the tool definition sent to the Large Language Model.
    GetDefinition() ToolDefinition
}

This design gives us:

Natural‑language outputs for planner feedback
Structured metadata for downstream use
Compile‑time safety
Safe parallel execution

The Tool Registry acts as a trust boundary. Planner outputs are treated as intent, not direct instructions.

Parallel Tool Execution

Planner‑generated tool calls are executed concurrently whenever possible. Go’s concurrency model makes this practical:

Lightweight goroutines
Context‑based cancellation
Efficient I/O‑bound execution

This is one of the reasons Go scales better than Python when agentic systems move beyond prototypes.

Response Generation and Streaming

Once tools complete, responses flow into the Response Generator.

Knowledge‑based queries are summarized and synthesized using an LLM.
Direct‑answer queries (weather, sports, stocks) bypass synthesis and return raw tool output.

Responses are streamed via Server‑Sent Events (SSE) so users see partial results early, improving perceived latency.

Caching Strategy: Making Agentic Search Economical

One production reality became clear almost immediately: LLM calls have real cost — in both latency and dollars.

Once we began serving beta traffic, caching became mandatory. Our guiding principle was simple: Avoid LLM calls whenever possible.

Layer 1: Semantic Cache (Full Response)

We first check a semantic cache keyed on the user query.

Cache hit → return response immediately
Cache miss → continue to the next layer

The entire agentic flow is bypassed on a hit, delivering the biggest latency and cost win.

Layer 2: Planner Response Cache

If the semantic cache misses, we check whether the planner output (tool plan) is cached.

Cache hit → skip the planner LLM call and execute tools directly
Cache miss → invoke the planner LLM

Planner calls are among the most expensive and variable operations — caching them stabilizes both latency and cost.

Layer 3: Summarizer Cache

Finally, we cache summarizer outputs.

Tool results often repeat
Final synthesis can be reused
Reduces LLM load during traffic spikes

Each cache layer short‑circuits a different part of the pipeline.

Lessons from Production

A few hard‑earned lessons:

LLM calls are expensive — caching isn’t optional at scale
Semantic caching pays off immediately
Planner loops must be gated
Most queries are simpler than they look
Tools fail — retries and fallbacks matter
Observability is non‑negotiable
Agents aren’t autonomous — orchestration beats autonomy

Beyond Keywords: Engineering a Production-Ready Agentic Search Framework in Go

Search Systems and Agentic Search

From Search to Agentic Search

High‑Level Architecture

Flow Orchestrator: The Control Plane

Query Planner: Mandatory First Pass, Conditional Iteration

First Planner Call (Always)

Lightweight Classifier Gate

Second Planner Call (Only for Multi‑Step Queries)

Tool Registry: Where Reasoning Meets Reality

Parallel Tool Execution

Response Generation and Streaming

Caching Strategy: Making Agentic Search Economical

Layer 1: Semantic Cache (Full Response)

Layer 2: Planner Response Cache

Layer 3: Summarizer Cache

Lessons from Production

Related posts

A Beginner’s Guide to AIOps: What IT Teams Need to Know

Regression testing workflow: the risk first checks that keep releases stable

The Best Developer AI Tools of 2025 — What Actually Worked in Real Projects

Migrate Droplet from DO to AWS using AWS Migration Application Service (MGN)

Search Systems and Agentic Search

From Search to Agentic Search

High‑Level Architecture

Flow Orchestrator: The Control Plane

Query Planner: Mandatory First Pass, Conditional Iteration

First Planner Call (Always)

Lightweight Classifier Gate

Second Planner Call (Only for Multi‑Step Queries)

Tool Registry: Where Reasoning Meets Reality

Parallel Tool Execution

Response Generation and Streaming

Caching Strategy: Making Agentic Search Economical

Layer 1: Semantic Cache (Full Response)

Layer 2: Planner Response Cache

Layer 3: Summarizer Cache

Lessons from Production

Related posts

A Beginner’s Guide to AIOps: What IT Teams Need to Know

Regression testing workflow: the risk first checks that keep releases stable

The Best Developer AI Tools of 2025 — What Actually Worked in Real Projects

Migrate Droplet from DO to AWS using AWS Migration Application Service (MGN)

Layer 1: Semantic Cache (Full Response)

Layer 2: Planner Response Cache

Layer 3: Summarizer Cache