How Code-Executing AI Agents are Making 128K Context Windows Obsolete

Published: 0 month ago (January 10, 2026 at 11:02 AM EST)

3 min read

Source: Dev.to

The Problem: Context Rot

Long‑context windows are expensive, slow, and often wasted.
When an agent analyzes a 50,000‑word document it typically loads the entire text into its context, processes it once, and then struggles to recall a specific sentence from the middle. This phenomenon—context rot—occurs because attention scores dilute over many tokens, causing the model to forget what it just read.

Buying a larger context window is like buying a larger suitcase because you can’t decide what to pack; it doesn’t solve the underlying organizational problem.

The RLM Inversion: Don’t Process, Orchestrate

The Recursive Language Model (RLM) flips the script. Instead of ingesting data, it interacts with data.

“The LLM’s context is not a storage tank. It’s a workbench.”

An RLM is given a persistent Python REPL. The data—whether a 10,000‑page PDF or a massive database—is not loaded into the model’s context. It exists as a variable, input_data, accessible only through code. This forces a fundamental shift in behavior:

1. Search, Don’t Read

The RLM cannot “see” the data directly. It must write Python code to search for keywords, filter entities, or slice specific sections, retrieving only what it needs.

2. Store in RAM, Not in Neurons

Intermediate findings are stored in Python variables rather than in the model’s context history. This acts as an “extended memory” that doesn’t suffer from attention decay.

3. Delegate, Don’t Deliberate

For large datasets, the RLM can spawn sub‑LLMs—fresh model instances with clean contexts. It can batch‑process 100 document chunks in parallel via llm_batch(). The main RLM only sees the summaries, keeping its own context crystal clear.

The “Diffusion” Answer: Multi‑Turn Reasoning

Traditional chat models produce a one‑shot response; once a sentence is written, it’s locked in. An RLM operates differently. It initializes an answer state and diffuses its answer over multiple reasoning turns, drafting, fact‑checking, revising, and only setting ready=True when the artifact is refined.

# Diffusion answer skeleton
answer = {"content": "", "ready": False}

# Example workflow
while not answer["ready"]:
    # generate a draft fragment
    fragment = llm_generate(...)
    answer["content"] += fragment

    # optional verification step
    if verify(answer["content"]):
        answer["ready"] = True

Traditional Context vs. RLM

Aspect	Traditional Long‑Context	Recursive Language Model (RLM)
Data Handling	Load everything into context	Access programmatically via code
Memory	Attention‑based (decays)	Python variables (persistent)
Scaling	Larger context window	Parallel sub‑LLM delegation
Transparency	Black box	Fully auditable code trace

Get Involved

The RLM paradigm is more than a theory; it’s an architecture you can explore today. A reference implementation built with PydanticAI and FastAPI is open‑sourced.

Repository:

The future doesn’t belong to the model with the longest memory; it belongs to the one that knows it doesn’t need to remember everything.

How Code-Executing AI Agents are Making 128K Context Windows Obsolete

The Problem: Context Rot

The RLM Inversion: Don’t Process, Orchestrate

1. Search, Don’t Read

2. Store in RAM, Not in Neurons

3. Delegate, Don’t Deliberate

The “Diffusion” Answer: Multi‑Turn Reasoning

Traditional Context vs. RLM

Get Involved

Related posts

Why Your AI's Context Window Problem Just Got Solved (And What It Means For Your Bottom Line)

The `/context` Command: X-Ray Vision for Your Tokens

The 2M Token Trap: Why 'Context Stuffing' Kills Reasoning

MCP Token Limits: The Hidden Cost of Tool Overload