The 2M Token Trap: Why 'Context Stuffing' Kills Reasoning
Source: Dev.to
Why more context often makes LLMs worse—and what to do instead
1. Introduction
The Context‑Window Arms Race
The expansion of context windows has been staggering:
- Early 2023 – GPT‑4 launches with 32 K tokens
- Nov 2023 – GPT‑4 Turbo extends to 128 K
- Mar 2024 – Claude 3 reaches 200 K
- Feb 2024 – Gemini 1.5 hits 1 M (later 2 M)
In just two years, capacity grew from 32 K to 2 M tokens—a 62× increase.
The developer intuition was immediate and seemingly logical:
“If everything fits, just put everything in.”
The Paradox: More Context, Worse Results
Practitioners are discovering a counter‑intuitive pattern:
The more context you provide, the worse the model performs.
Typical symptoms:
- Supplying an entire codebase → misunderstood design intent
- Including exhaustive logs → critical errors overlooked
- Providing comprehensive documentation → unfocused responses
This phenomenon appears in the research literature as “Lost in the Middle” (Liu et al., 2023). Information placed in the middle of long contexts is systematically neglected.
The uncomfortable truth is:
A context window is not just storage capacity; it is cognitive load.
This article explores why Context Stuffing fails, what Anthropic’s Claude Code reveals about effective context management, and how to shift from Prompt Engineering to Context Engineering—the discipline of architectural curation for AI systems.
2. Why “More Context” Doesn’t Mean “Better Understanding”
Capacity vs. Capability
- Capacity – How much data fits in memory (e.g., 200 K, 2 M tokens)
- Capability – The ability to prioritize, connect, and reason over that data
A model that can ingest 2 M tokens does not pay equal attention to all of them.
Providing a 2 M‑token context to an LLM is like handing a new developer 10 000 pages of documentation on day one and expecting them to fix a bug in five minutes—they will drown.
Attention Dilution and “Lost in the Middle”
The limitation stems from the self‑attention mechanism. As token count rises, attention distributions flatten, signal‑to‑noise ratios drop, and relevant information gets buried. Liu et al. (2023) showed that information in the middle of long contexts is systematically neglected, even when explicitly relevant, while content at the beginning and end receives disproportionate attention.
Context expansion increases what can be accessed, not what can be understood.
Real‑World Symptoms
- Entire codebases → architectural misinterpretation
- Exhaustive logs → critical signals buried
- Comprehensive docs → answers drift off‑topic
These are not failures of model intelligence; they are failures of information structure and prioritization—problems no amount of context capacity can solve.
3. The 75 % Rule: Lessons from Claude Code
The Problem – Quality Degradation in Long Sessions
Claude Code, Anthropic’s terminal‑based coding agent with a 200 K context window, exhibited:
- Degraded code quality over long sessions
- Forgotten earlier design decisions
- Auto‑compact failures causing infinite loops
At the time, Claude Code routinely used > 90 % of its available context.
The Solution – Auto‑Compact at 75 %
In September 2024, Anthropic introduced a counter‑intuitive fix:
Trigger auto‑compact when context usage reaches 75 %.
Result:
- ~150 K tokens used for storage
- ~50 K tokens deliberately left empty
What looked like waste turned out to be the key to dramatic quality improvements.
Why It Works – Inference Space
Hypotheses:
- Context Compression – Low‑relevance information is removed
- Information Restructuring – Summaries reorganize scattered data
- Preserving Room for Reasoning – Empty space enables generation
“That free context space isn’t wasted—it’s where reasoning happens.” – Developer
This mirrors computer memory behavior: running at 95 % RAM doesn’t mean the remaining 5 % is idle; it’s system overhead. Push to 100 %, and everything grinds to a halt.
Takeaway
- Filling context to capacity degrades output quality.
- Effective context management requires headroom—space reserved for reasoning, not just retrieval.
4. The Three Principles of Context Engineering
The era of prompt‑wording tweaks is ending. As Hamel Husain observed:
“AI Engineering is Context Engineering.”
The critical skill is no longer what you say to the model, but what you put in front of it—and what you deliberately leave out.
Principle 1: Isolation
Do not dump the monolith.
Borrow Bounded Contexts from Domain‑Driven Design. Provide the smallest effective context for the task.
Example – Add OAuth2 authentication
| Needed | Not Needed |
|---|---|
User model | Billing module |
SessionController | CSS styles |
routes.rb | Unrelated APIs |
| Relevant auth middleware | Other test fixtures |
Ask: What is the minimum context required to solve this problem?
Principle 2: Chaining
Pass artifacts, not histories.
Break workflows into stages:
- Plan → generate a concise plan (few hundred tokens)
- Execute → run the plan using only the artifacts produced in step 1
- Reflect → summarize results and feed back a distilled view to the next iteration
Each stage receives only the previous stage’s output, not the entire conversation history.
Principle 3: Reservation
Leave room for the model to think.
- Reserve ~25 % of the context window as “thinking space.”
- Use this space for scratchpad reasoning, intermediate calculations, or on‑the‑fly summarisation.
When the model runs out of headroom, it must truncate or compress, which often leads to loss of crucial reasoning steps.
5. Putting It All Together
- Audit your current prompts: identify monolithic dumps.
- Segment the problem into bounded contexts.
- Chain the workflow, feeding only the necessary artifacts forward.
- Reserve ~25 % of the context window for reasoning.
- Iterate: after each cycle, compact the context (summarise, prune) before the next pass.
By treating context as a design resource rather than a free storage bin, you’ll see:
- Higher fidelity to design intent
- Faster convergence on correct solutions
- More predictable, maintainable AI‑assisted workflows
TL;DR
- Bigger context windows ≠ better understanding.
- “Lost in the Middle” shows attention drops for middle tokens.
- Claude Code’s 75 % auto‑compact rule proves headroom matters.
- Adopt Isolation, Chaining, and Reservation to engineer effective context.
Happy context‑engineering! 🚀
Principle 3: Headroom
Never run a model at 100 % capacity.
Adopt the 75 % Rule.
Token limits usually cover input + output. Stuffing 195 K tokens into a 200 K window leaves almost no room for reasoning.
Ask:
- Can this be decomposed into stages that pass summaries instead of transcripts?
- Have I left enough space for the model to think—not just respond?
Treat the context window as a scarce cognitive resource, not infinite storage.
5. Why 200 K Is the Sweet Spot
Cognitive Scale
150 K tokens (≈ 75 % of 200 K) is roughly one technical book—the largest coherent “project state” both humans and LLMs can manage. Beyond that, you need chapters, summaries, and architecture.
Cost and Latency
Attention scales at O(n²).
Doubling context quadruples cost.
200 K balances performance, latency, and cost.
Methodological Discipline
200 K forces curation.
Exceeding it is a code smell: unclear boundaries, oversized tasks, or stuffing instead of structuring.
Anthropic offers 1 M tokens—but behind premium tiers.
The implicit message: 1 M is for special cases. 200 K is the default for a reason.
The constraint is not a limitation—it is the design principle.
6. Conclusion: From Prompt Engineering to Context Engineering
The context‑window arms race delivered a 62× increase in capacity, but capacity was never the bottleneck.
The bottleneck is—and always has been—curation.
| Prompt Engineering | Context Engineering |
|---|---|
| “How do I phrase this?” | “What should the model see?” |
| Optimizing words | Architecting information |
| Single‑shot prompts | Multi‑stage pipelines |
| Filling capacity | Preserving headroom |
Three Questions to Ask Before Every Task
-
Am I stuffing context just because I can?
Relevant beats exhaustive. -
Is this context isolated to the real problem?
If you can’t state the boundary, you haven’t found it. -
Have I left room for the model to think?
Output quality requires input restraint.
The era of prompt engineering rewarded clever wording.
The era of context engineering rewards architectural judgment.
The question is no longer: What should I say to the model?
The question is: What world should the model see?
7. References
Research Papers
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023)
Tools & Methodologies
- planstack.ai –
Empirical Studies
- Greg Kamradt, Needle in a Haystack
Articles & Analysis
- (Add entries as needed)
