Architecting efficient context-aware multi-agent framework for production

Published: 1 month ago (December 17, 2025 at 04:15 AM EST)

11 min read

Source: Google Developers Blog

Overview

The landscape of AI agent development is shifting fast. We’ve moved beyond prototyping single‑turn chatbots. Today, organizations are deploying sophisticated, autonomous agents to handle long‑horizon tasks: automating workflows, conducting deep research, and maintaining complex codebases.

That ambition immediately runs into a bottleneck: context.

As agents run longer, the amount of information they need to track—chat history, tool outputs, external documents, intermediate reasoning—explodes. The prevailing “solution” has been to lean on ever‑larger context windows in foundation models. But simply giving agents more space to paste text cannot be the single scaling strategy.

To build production‑grade agents that are reliable, efficient, and debuggable, the industry is exploring a new discipline:

Context engineering

Treating context as a first‑class system with its own architecture, lifecycle, and constraints.

Based on our experience scaling complex single‑ or multi‑agentic systems, we designed and evolved the context stack in Google Agent Development Kit (ADK) to support that discipline. ADK is an open‑source, multi‑agent‑native framework built to make active context engineering achievable in real systems.

The scaling bottleneck

A large context window will help context‑related problems but won’t address all of them. In practice, the naive pattern—append everything into one giant prompt—collapses under a three‑way pressure:

Cost and latency spirals: Model cost and time‑to‑first‑token grow quickly with context size. “Shoveling” raw history and verbose tool payloads into the window makes agents prohibitively slow and expensive.
Signal degradation (“lost in the middle”): A context window flooded with irrelevant logs, stale tool outputs, or deprecated state can distract the model, causing it to fixate on past patterns rather than the immediate instruction. To ensure robust decision‑making, we must maximize the density of relevant information.
Physical limits: Real‑world workloads—involving full RAG results, intermediate artifacts, and long conversation traces—eventually overflow even the largest fixed windows.

Throwing more tokens at the problem buys time, but it doesn’t change the shape of the curve. To scale, we need to change how context is represented and managed, not just how much of it we can cram into a single call.

The design thesis: context as a compiled view

In the previous generation of agent frameworks, context was treated like a mutable string buffer. ADK is built around a different thesis:

Context is a compiled view over a richer stateful system.

In that view:

Sessions, memory, and artifacts (files) are the sources – the full, structured state of the interaction and its data.
Flows and processors are the compiler pipeline – a sequence of passes that transform that state.
The working context is the compiled view you ship to the LLM for a single invocation.

Once you adopt this mental model, context engineering stops being prompt gymnastics and starts looking like systems engineering. You are forced to ask standard systems questions: What is the intermediate representation? Where do we apply compaction? How do we make transformations observable?

ADK’s architecture answers these questions via three design principles:

Separate storage from presentation – We distinguish between durable state (Sessions) and per‑call views (working context). This allows you to evolve storage schemas and prompt formats independently.
Explicit transformations – Context is built through named, ordered processors, not ad‑hoc string concatenation. This makes the “compilation” step observable and testable.
Scope by default – Every model call and sub‑agent sees the minimum context required. Agents must reach for more information explicitly via tools, rather than being flooded by default.

ADK’s tiered structure, its relevance mechanisms, and its multi‑agent handoff semantics are essentially an application of this “compiler” thesis and the three principles:

Structure – a tiered model that separates how information is stored from what the model sees.
Relevance – agentic and human controls that decide what matters now.
Multi‑agent context – explicit semantics for handing off the right slice of context between agents.

The next sections walk through each of these pillars in turn.

Structure: The tiered model

Most early agent systems implicitly assume a single window of context. ADK goes the other way. It separates storage from presentation and organizes context into distinct layers, each with a specific job:

Layer	Description
Working context	The immediate prompt for this model call: system instructions, agent identity, selected history, tool outputs, optional memory results, and references to artifacts.
Session	The durable log of the interaction: every user message, agent reply, tool call, tool result, control signal, and error, captured as structured `Event` objects.
Memory	Long‑lived, searchable knowledge that outlives a single session: user preferences, past conversations, etc.
Artifacts	Large binary or textual data associated with the session or user (files, logs, images), addressed by name and version rather than pasted into the prompt.

1.1 Working context as a recomputed view

For each invocation, ADK rebuilds the Working Context from the underlying state. It starts with instructions and identity, pulls in selected Session events, and optionally attaches memory results. This view is:

Ephemeral – discarded after the call.
Configurable – you can change formatting without migrating storage.
Model‑agnostic – works with any LLM backend.

This flexibility is the first win of the compiler view: you stop hard‑coding “the prompt” and start treating it as a derived representation.

1.2 Flows and processors: context processing as a pipeline

Once you separate storage from presentation, you need machinery to compile one into the other. In ADK, every LLM‑based agent is backed by an LLM Flow, which maintains ordered lists of processors.

A (simplified) SingleFlow might look like:

# Example of a simplified SingleFlow definition
flow = SingleFlow(
    processors=[
        "static_instruction",   # immutable system prompt
        "contents",            # builds the working context from the Session
        "retrieval",           # optional external knowledge fetch
        "tool_calls",          # invoke tools if needed
        "response",            # generate the final LLM reply
    ]
)

These flows are ADK’s machinery to compile context. The order matters: each processor builds on the outputs of the previous steps. This gives you natural insertion points for custom filtering, compaction strategies, caching, and multi‑agent routing. You are no longer rewriting giant prompt templates; you’re just adding or reordering processors.

1.3 Session and events: structured, language‑agnostic history

An ADK Session represents the definitive state of a conversation or workflow instance. Concretely, it acts as a container for:

Session metadata (IDs, app names)
A state scratchpad for structured variables
Events – a chronological list of strongly‑typed records

Instead of storing raw prompt strings, ADK captures every interaction—user messages, agent replies, tool calls, results, control signals, and errors—as Event records. This structural choice yields three distinct advantages:

Advantage	Why it matters
Model agnosticism	Swap underlying models without rewriting history; storage format is decoupled from prompt format.
Rich operations	Downstream components (compaction, time‑travel debugging, memory ingestion) can operate over a rich event stream rather than parsing opaque text.
Observability	Provides a natural surface for analytics, letting you inspect precise state transitions and actions.

The bridge between this session and the working context is the contents processor. It performs the heavy lifting of transforming the Session into the history portion of the working context by executing three critical steps:

Selection – filters the event stream to drop irrelevant events, partial events, and framework noise that shouldn’t reach the model.
Transformation – flattens the remaining events into Content objects with the correct roles (user/assistant/tool) and annotations for the specific model API being used.
Injection – writes the formatted history into llm_request.contents, ensuring downstream processors—and the model itself—receive a clean, coherent conversational trace.

In this architecture, the Session is your ground truth; the working context is merely a computed projection that you can refine and optimize over time.

1.4 Context compaction and filtering at the session layer

If you keep appending raw events indefinitely, latency and token usage will inevitably spiral out of control. ADK’s Context Compaction feature attacks this problem at the Session layer.

When a configurable threshold (e.g., number of invocations) is reached, ADK triggers an asynchronous process that:

Uses an LLM to summarize older events over a sliding window (defined by compaction intervals and overlapping size).
Writes the resulting summary back into the Session as a new event with a "compaction" action.
Allows the system to prune or de‑prioritize the raw events that were summarized.

Because compaction operates on the Event stream itself, the benefits cascade downstream:

Scalability – Sessions remain physically manageable even for extremely long‑running conversations.
Clean views – The contents processor automatically works over a history that is already compacted, requiring no complex logic at query time.
Decoupling – You can tune compaction prompts and strategies without touching a single line of agent code or template logic.

For strictly rule‑based reduction, ADK offers a sibling operation—Filtering—where pre‑built plugins can globally drop or trim context based on deterministic rules before it ever reaches the model.

1.5 Context caching

Modern models support context caching (prefix caching), which allows the inference engine to reuse attention computation across calls. ADK’s separation of Session (storage) and Working Context (view) provides a natural substrate for this optimization.

The architecture effectively divides the context window into two zones:

Zone	Typical contents
Stable prefixes	System instructions, agent identity, long‑lived summaries
Variable suffixes	Latest user turn, new tool outputs, small incremental updates

Because ADK flows and processors are explicit, you can treat cache‑friendliness as a hard design constraint. By ordering your pipeline to keep frequently reused segments stable at the front of the context window—and pushing highly dynamic content toward the end—you maximize cache hit rates.

To enforce this rigor, ADK introduces static_instruction, a primitive that guarantees immutability for system prompts, ensuring that the cache prefix remains valid across invocations.

Relevance: Agentic management of what matters now

Once the structure is established, the core challenge shifts to relevance: Given a tiered context architecture, what specific information belongs in the model’s active window right now?

ADK answers this through a collaboration between human domain knowledge and agentic decision‑making.

Hard‑coded rules are cost‑effective but rigid.
Pure agent‑driven browsing is flexible but prohibitively expensive and unstable.

An optimal Working Context is a negotiation between the two. Human engineers define the architecture—where data lives, how it is summarized, and what filters apply. The Agent then provides the intelligence, deciding dynamically when to ….

2.1 Artifacts: externalizing large state

Early agent implementations often fall into the context dumping trap: placing large payloads—a 5 MB CSV, a massive JSON API response, or a full PDF transcript—directly into the chat history. This creates a permanent tax on the session; every subsequent turn drags that payload along, burying critical instructions and inflating costs.

ADK solves this by treating large data as Artifacts: named, versioned binary or text objects managed by an ArtifactService.

Conceptually, ADK applies a handle pattern to large data. Large data lives in the artifact store, not the prompt. By default, agents see only a lightweight reference (a name and summary) via the request processor. When—and only when—an agent requires the raw data to answer a question, it uses the LoadArtifactsTool. This action temporarily loads the content into the Working Context.

Crucially, ADK supports ephemeral expansion. Once the model call or task is complete, the artifact is offloaded from the working context by default. This turns “5 MB of noise in every prompt” into a precise, on‑demand resource. The data can be huge, but the context window remains lean.

2.2 Memory: long‑term knowledge, retrieved on demand

Where Artifacts handle discrete, large objects, ADK’s Memory layer manages long‑lived, semantic knowledge that extends beyond a single session—user preferences, past decisions, and domain facts.

The MemoryService is built around two principles:

Searchability – memory must be searchable (not permanently pinned).
Agent‑directed retrieval – agents decide when to fetch it.

The service ingests data—often from finished Sessions—into a vector or keyword corpus. Agents then access this knowledge via two patterns:

Reactive recall – the agent recognizes a knowledge gap (“What is the user’s dietary restriction?”) and explicitly calls the load_memory_tool to search the corpus.
Proactive recall – a pre‑processor runs a similarity search based on the latest user input, injecting likely relevant snippets via the preload_memory_tool before the model is even invoked.

This replaces the “context stuffing” anti‑pattern with a “memory‑based” workflow. Agents recall exactly the snippets they need for the current step, rather than carrying the weight of every conversation they have ever had.

Multi‑agent context: who sees what, when

Single‑agent systems struggle with context bloat; multi‑agent systems amplify it. If a root agent passes its full history to a sub‑agent, and that sub‑agent does the same, you trigger a context explosion. Token counts skyrocket, and sub‑agents become confused by irrelevant conversational history.

Whenever an agent invokes another agent, ADK lets you explicitly scope what the callee sees—maybe just the latest user query and one artifact—while suppressing most of the ancestral history.

3.1 Two multi‑agent interaction patterns

Agents as Tools – the root agent treats a specialized agent strictly as a function: call it with a focused prompt, get a result, and move on. The callee sees only the specific instructions and necessary artifacts—no history.
Agent Transfer (Hierarchy) – control is fully handed off to a sub‑agent to continue the conversation. The sub‑agent inherits a view over the Session and can drive the workflow, calling its own tools or transferring control further down the chain.

3.2 Scoped handoffs for agent transfer

Handoff behavior is controlled by knobs like include_contents on the callee, which determine how much context flows from the root agent to a sub‑agent.

Default mode – ADK passes the full contents of the caller’s working context (useful when the sub‑agent genuinely benefits from the entire history).
None mode – the sub‑agent sees no prior history; it only receives the new prompt you construct for it (e.g., the latest user turn plus a couple of tool calls and responses).

Specialized agents get the minimal context they need, rather than inheriting a giant transcript by default.

Because a sub‑agent’s context is also built via processors, these handoff rules plug into the same flow pipeline as any other context‑building step.

Architecting efficient context-aware multi-agent framework for production

Overview

Context engineering

The scaling bottleneck

The design thesis: context as a compiled view

Structure: The tiered model

1.1 Working context as a recomputed view

1.2 Flows and processors: context processing as a pipeline

1.3 Session and events: structured, language‑agnostic history

1.4 Context compaction and filtering at the session layer

1.5 Context caching

Relevance: Agentic management of what matters now

2.1 Artifacts: externalizing large state

2.2 Memory: long‑term knowledge, retrieved on demand

Multi‑agent context: who sees what, when

3.1 Two multi‑agent interaction patterns

3.2 Scoped handoffs for agent transfer

Related posts

Shallow Agents vs Deep Agents: How Deep Research Actually Works Inside GPT-Like Systems

After building and observing AI systems in real operational environments, one conclusion has become clear to me: Prompt engineering was a necessary phase, but Context engineering is what actually builds durable systems.

AI: The Real 10x Productivity Hack

Transformers Are Dead. Google Killed Them – Then Went Silent

Overview

Context engineering

The scaling bottleneck

The design thesis: context as a compiled view

Structure: The tiered model

1.1 Working context as a recomputed view

1.2 Flows and processors: context processing as a pipeline

1.3 Session and events: structured, language‑agnostic history

1.4 Context compaction and filtering at the session layer

1.5 Context caching

Relevance: Agentic management of what matters now

2.1 Artifacts: externalizing large state

2.2 Memory: long‑term knowledge, retrieved on demand

Multi‑agent context: who sees what, when

3.1 Two multi‑agent interaction patterns

3.2 Scoped handoffs for agent transfer

Related posts

Shallow Agents vs Deep Agents: How Deep Research Actually Works Inside GPT-Like Systems

After building and observing AI systems in real operational environments, one conclusion has become clear to me: Prompt engineering was a necessary phase, but Context engineering is what actually builds durable systems.

AI: The Real 10x Productivity Hack

Transformers Are Dead. Google Killed Them – Then Went Silent

1.2 Flows and processors: context processing as a pipeline

1.3 Session and events: structured, language‑agnostic history

1.4 Context compaction and filtering at the session layer

1.5 Context caching