How Code-Executing AI Agents are Making 128K Context Windows Obsolete

Published: (January 10, 2026 at 11:02 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

The Problem: Context Rot

Long‑context windows are expensive, slow, and often wasted.
When an agent analyzes a 50,000‑word document it typically loads the entire text into its context, processes it once, and then struggles to recall a specific sentence from the middle. This phenomenon—context rot—occurs because attention scores dilute over many tokens, causing the model to forget what it just read.

Buying a larger context window is like buying a larger suitcase because you can’t decide what to pack; it doesn’t solve the underlying organizational problem.

The RLM Inversion: Don’t Process, Orchestrate

The Recursive Language Model (RLM) flips the script. Instead of ingesting data, it interacts with data.

“The LLM’s context is not a storage tank. It’s a workbench.”

An RLM is given a persistent Python REPL. The data—whether a 10,000‑page PDF or a massive database—is not loaded into the model’s context. It exists as a variable, input_data, accessible only through code. This forces a fundamental shift in behavior:

1. Search, Don’t Read

The RLM cannot “see” the data directly. It must write Python code to search for keywords, filter entities, or slice specific sections, retrieving only what it needs.

2. Store in RAM, Not in Neurons

Intermediate findings are stored in Python variables rather than in the model’s context history. This acts as an “extended memory” that doesn’t suffer from attention decay.

3. Delegate, Don’t Deliberate

For large datasets, the RLM can spawn sub‑LLMs—fresh model instances with clean contexts. It can batch‑process 100 document chunks in parallel via llm_batch(). The main RLM only sees the summaries, keeping its own context crystal clear.

The “Diffusion” Answer: Multi‑Turn Reasoning

Traditional chat models produce a one‑shot response; once a sentence is written, it’s locked in. An RLM operates differently. It initializes an answer state and diffuses its answer over multiple reasoning turns, drafting, fact‑checking, revising, and only setting ready=True when the artifact is refined.

# Diffusion answer skeleton
answer = {"content": "", "ready": False}

# Example workflow
while not answer["ready"]:
    # generate a draft fragment
    fragment = llm_generate(...)
    answer["content"] += fragment

    # optional verification step
    if verify(answer["content"]):
        answer["ready"] = True

Traditional Context vs. RLM

AspectTraditional Long‑ContextRecursive Language Model (RLM)
Data HandlingLoad everything into contextAccess programmatically via code
MemoryAttention‑based (decays)Python variables (persistent)
ScalingLarger context windowParallel sub‑LLM delegation
TransparencyBlack boxFully auditable code trace

Get Involved

The RLM paradigm is more than a theory; it’s an architecture you can explore today. A reference implementation built with PydanticAI and FastAPI is open‑sourced.

Repository:

The future doesn’t belong to the model with the longest memory; it belongs to the one that knows it doesn’t need to remember everything.

Back to Blog

Related posts

Read more »