LLM386: borrowing a 1990s idea for managing LLM context

Published: 1 hour ago (May 4, 2026 at 07:46 PM EDT)

4 min read

Source: Dev.to

In 1989, DOS had a 640 KB ceiling on conventional memory. EMM386 used the 80386 CPU’s address‑translation hardware to page chunks of a much larger memory space through a small fixed window inside that 640 KB. Programs that asked nicely got effectively unlimited memory through a peephole, by paging only what was relevant for the current operation.

LLMs have the same problem.

The context window is bounded—32 K, 128 K, 1 M tokens. Your data (conversation history, retrieved documents, tool results, persistent facts) will exceed any window you’re willing to pay for. Every call must choose what gets through.

The common approach is ad‑hoc: keep messages in a list, retrieve “the last N plus a vector hit,” concatenate, send. This breaks down once the prompt grows enough that you can’t trace what’s been included. The model gives an answer; nobody can explain why; two turns produce different responses for reasons that aren’t recorded anywhere.

LLM386 is the runtime EMM386 was, applied to LLM context windows.

The thesis

f(context) → output

The model is a pure function: no memory, no persistence, no cross‑call state. All continuity has to be reconstructed on each call.

Durable state lives in a store owned by the runtime. The model is a stateless consumer.
The prompt for each call is recomputed from that store, with the model’s input budget as the constraint.

What’s in the runtime

Persistent block store – LMDB, content‑addressed, deduped on hash.
Pager – selects which blocks fit the model’s input budget by running configured retrievers in parallel (recency, BM25, embedding ANN, custom), normalizing scores, merging by max‑per‑block, and allocating across canonical sections: System, Task, State, Plan, Retrieved, Tools, Recent, Background.
Packer – renders the selection into a deterministic prompt string or a role‑tagged chat‑message list.
Tracer – records what the model saw and why, with byte‑level prompt hashes for replay.
Reducer – turns model output back into committed state via parsed events.
Typed‑edge graph – ties dependent blocks together so the pager keeps tool results paired with the assistant message that called them.
Diff layer – compares two trace records turn‑over‑turn.

Implemented as a Rust library, Python SDK (PyO3 native extension), and CLI. Licensed Apache‑2.0. Current version: 1.0.0‑alpha.

What’s deliberately not in there

No chatbot UI.
No hidden state inside prompts.
No treating model output as truth.
No distributed storage in the initial version.
No learned components in the hot path – every retriever, packer, and reducer is deterministic, which makes trace replayable. A learned reranker or trained embedding tweaker would break that property, so they are intentionally omitted.

Try it

git clone https://github.com/fitzee/llm386
cd llm386
export ANTHROPIC_API_KEY=sk-ant-...
docker compose -f examples/langgraph-agent/docker-compose.yml run --rm agent

Five minutes from clone to chatting. A small chatbot with two stub tools (a calculator and a fake user‑profile lookup) runs with LLM386 as the memory layer. Conversation persists across container restarts because the store is a Docker volume. The model recalls things from prior turns; that recall is provided entirely by the runtime, since LangGraph holds no state between turns.

Should you use it?

Yes – if you have an agent that works in development but the prompts are a mess and you can’t reason about what the model is seeing.
Probably not – if you just need a quick chatbot demo; use the simplest thing that runs.
Yes – if you want to swap models without rewriting prompt assembly. The ModelProfile abstraction carries context window, tokenizer, and capability flags; the pager and packer respect that contract regardless of which model you swap in.

As agents get more complex, “what’s actually in the prompt right now?” becomes a hard question for many stacks. The runtime is designed to keep this cheap and transparent.

EMM386 worked because a bounded window into a larger memory was the right abstraction for a structurally constrained system. The same abstraction applies to LLM context windows three decades later.

GitHub:

LLM386: borrowing a 1990s idea for managing LLM context

The thesis

What’s in the runtime

What’s deliberately not in there

Try it

Should you use it?

Related posts

Claude Moves Fast. Codex Ships.

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

Token Consumption Anxiety and the Open Source App I Built to Solve It