When an AI Keeps Forgetting: Why LLM Workflows Collapse and What to Build Instead
Source: Dev.to
The Problem
I was six months into building a career‑intelligence project across ChatGPT and Claude when I noticed the rot. Terms I’d defined precisely were drifting — “Career Intelligence Framework” became “Career Intel System” in one session, “CI Framework” in another. Decisions I’d made weeks earlier resurfaced as open questions. I’d explain a concept, get good work from it, then three sessions later have to explain it again because the model had no memory of the conversation where we’d settled it. References were getting vague — “update the ontology” could mean three different files depending on which session you were in.
I thought something was mis‑configured. It wasn’t. ChatGPT loses track of terminology due to token windows, memory constraints, and the architecture of LLMs. The drift isn’t a bug you can report; it’s the default outcome of how these systems work — they regenerate language from patterns rather than retrieving it from stable storage. Every response is a fresh reconstruction, not a recall. Without scaffolding to hold the context steady, long‑term projects erode.
That erosion had a shape. I started cataloguing it.
Why It Breaks
The failure modes weren’t random. They fell into seven categories that kept recurring across every project thread, every model, every session length. I named them C1–C7 — a collapse‑risk taxonomy — because naming them precisely was the first step toward designing countermeasures.
| Risk | Description | How to Spot It |
|---|---|---|
| C1: Context saturation | Too much material floods the token window; the model stops tracking earlier context. | The model asks about something you covered twenty messages ago, or responses become generic because it can’t focus. |
| C2: Instruction dilution | Overlapping or conflicting instructions pile up. The model tries to satisfy all of them and satisfies none well. | Output quality degrades despite good context; the instructions are competing. |
| C3: Vocabulary drift | The model regenerates terminology from patterns instead of retrieving canonical terms. | “Career Intelligence Framework” → “Career Intelligence System” → “your CI project.” You’re unsure which name is official. |
| C4: Reference ambiguity | References (“that config file”, “the framework we discussed”, “update the ontology”) become vague across sessions. | You have to clarify which file, concept, or version you mean. |
| C5: Goal creep | Project scope shifts silently as the conversation expands. | You start designing a context‑pack template; three exchanges later you’re discussing full automation pipelines. No checkpoint catches the drift. |
| C6: Evidence entropy | Provenance erodes; the rationale for a decision lives in a conversation that’s now compressed or closed. | You can’t recall why a design choice was made because there’s no frozen snapshot. |
| C7: Thread fragmentation | Conversations split across sessions, models, and tools. No single view stitches them together. | Vocabulary lives in one thread, architecture decisions in another, implementation in a third. |
These risks are interdependent. Vocabulary drift (C3) creates reference ambiguity (C4). Context saturation (C1) accelerates instruction dilution (C2). Thread fragmentation (C7) amplifies all of them because no single session has the full picture.
What I Tried
My first instinct was comprehensive: design a full scaffolding system — the Harness — that mapped every collapse risk to a specific countermeasure.
1. Ontology.yml (anti‑C3)
Purpose: Vocabulary stability.
Contents: One canonical term per concept, approved synonyms, unique IDs.
How it helps: When the model drifts from “Career Intelligence Framework” to “Career Intel System,” the ontology is the authority. In theory a linter could catch drift in pull requests before it enters the codebase.
2. Context Packs (anti‑C1 & anti‑C2)
Purpose: Session re‑injection.
Structure:
purpose: "Update the career‑intel data model"
glossary_slice: ["Career Intelligence Framework", "CI Framework"]
current_milestone: "v0.3 – data ingestion"
active_constraints:
- token_limit: 8192
- no_new_dependencies
open_questions:
- "How to version the ontology?"
How it helps: Instead of dumping the full project into the context window, a compact bundle gives the model exactly what it needs for this session and nothing else.
3. Decision Log (anti‑C4 & anti‑C5)
Purpose: Rationale tracking.
**Decision**: Adopt GraphQL for the API layer
**Why**: Enables flexible client queries and reduces over‑fetching.
**Alternatives**: REST (rejected – higher latency), gRPC (rejected – steeper learning curve)
**Reversible**: Yes – abstraction layer isolates GraphQL specifics.
**Impacted artifacts**: api/schema.graphql, docs/api.md
How it helps: References stay unambiguous and scope changes become explicit rather than silent.
4. Checkpoints (anti‑C6)
Purpose: Frozen state.
Implementation: Git tags/releases that capture the repository at a known‑good moment (e.g., v0.3‑ontology‑stable).
How it helps: Provides a stable reference point for provenance.
5. Chronicle (anti‑C7)
Purpose: Continuity across fragmented threads.
Format: A narrative timeline that stitches conversations, decisions, and code changes into a coherent history.
How it helps: When work splits across sessions and models, the chronicle is the single thread that connects them.
The mapping was clean: every component targeted specific collapse risks, and every collapse risk had at least one countermeasure.
The Tension
The design tension hit when I compared the full Harness approach against a simpler one — Stage B, a GitHub‑based architecture with clear repository structure, front‑matter conventions, and manual discipline but no automation layer.
| Aspect | Harness | Stage B |
|---|---|---|
| Complexity | High – multiple YAML files, linter hooks, context‑pack generation scripts. | Low – conventional repo layout, naming conventions, manual updates. |
| Upkeep | Requires ongoing maintenance of ontology, context packs, and chronicle. | Relies on developer discipline; no extra tooling. |
| Risk of “meta‑work” | Real tasks can be slowed by scaffolding upkeep. | Faster iteration but higher chance of drift. |
| Protection against collapse risks | Explicit countermeasures for C1‑C7. | Implicit protection; many risks remain unaddressed. |
| Scalability | Designed to scale across sessions, models, and team members. | Works for small teams or short‑term projects. |
ChatGPT’s assessment was direct: the Harness brings complexity overhead and maintenance cost, while Stage B offers less friction but leaves the collapse risks largely unchecked.
Bottom Line
- Identify the seven collapse risks (C1‑C7).
- Choose a mitigation strategy that balances overhead with project longevity.
- If you need long‑term stability across many sessions and collaborators, invest in a Harness‑style scaffolding.
- If the project is short‑lived or you have a disciplined team, a lightweight Stage B approach may suffice.
Either way, being explicit about the risks is the first step toward keeping your LLM‑augmented project from silently falling apart.
cit about entropy risks; relies on manual verification.
The risk was building scaffolding that becomes the project. Spending more time maintaining the immune system than doing the work the immune system was supposed to protect.
What It Revealed
The Harness design exposed a principle I kept coming back to: the scaffolding is a servo, not the engine. A servo keeps the system upright; the engine does the actual work. When the servo becomes the work — when you’re spending sessions on ontology maintenance and context‑pack generation instead of the research and writing the project exists to produce — you’ve built the wrong thing.
The deeper insight was that collapse isn’t a failure state; it’s the default state.
- LLMs don’t have memory—they have pattern regeneration.
- They don’t retrieve your terms; they reconstruct approximations.
- They don’t recall your decisions; they infer from whatever’s in the current window.
- They don’t maintain project state; they work in a perpetual present.
Everything outside the current context window is effectively gone unless you re‑inject it, and re‑injection is always lossy—summaries strip nuance, context packs compress rationale, session handoffs lose texture.
Once I understood collapse as the baseline rather than the exception, the design question changed. It wasn’t “how do I prevent collapse?”—you can’t, not fully, not with current architecture. It became “where do I need the scaffolding to hold, and where can I let it flex?”
- Vocabulary stability matters for a long‑running project—terms that drift create compounding ambiguity.
- Decision provenance matters when you’ll revisit choices months later.
- Session continuity matters less for one‑off analysis and more for iterative development across weeks.
The scaffolding spectrum that emerged
- Repository structure & naming conventions – cost nothing and prevent the most common failures.
- Ontology – add when vocabulary drift becomes noticeable.
- Context packs – add when session re‑entry starts taking more time than the work itself.
- Automation – add when manual discipline fails under load.
Don’t build the full Harness on day one. Build the piece that addresses the collapse risk you’re actually hitting.
The Reusable Rule
If you’re running a long‑term project through LLMs—anything that spans more than a few sessions—the model will lose track. Not because it’s broken, but because it doesn’t have memory. It operates via pattern regeneration inside a finite context window, and everything outside that window is gone.
Diagnostic cues
- Re‑explaining concepts the model already worked with → context saturation or vocabulary drift.
- Model suggests something you already rejected → evidence of entropy.
- Uncertainty about the canonical version of a term → vocabulary drift compounding into reference ambiguity.
- Conversation quietly shifts from the original goal to something adjacent → goal creep.
Name the failure. The C1–C7 taxonomy isn’t sacred—rename, extend, or collapse it—but catalog what’s actually breaking so you can design against it specifically rather than generically.
Scaffolding principle
- Add components only when specific collapse risks appear, not before.
- Map every piece of infrastructure to the failure it prevents.
- If it doesn’t prevent a named failure, it’s probably meta‑work.
And remember the servo principle: scaffolding exists to prevent collapse so you can do actual work. The moment it becomes the work, you’ve built the wrong thing.