Stop Feeding 'Junk' Tokens to Your LLM. (I Built a Proxy to Fix It)
Source: Dev.to – “Stop feeding junk tokens to your LLM – I built a proxy to fix it”
Overview
I recently built an agent to handle some SRE tasks—fetching logs, querying databases, searching code. It worked, but when I looked at the traces I was annoyed.
It wasn’t just that it was expensive (the bill was climbing). It was the sheer inefficiency.
- A single tool output—a search for Python files—was 40 000 tokens.
- About 35 000 tokens were just
"type": "file"and"language": "python"repeated 2 000 times.
We were paying premium compute prices to force state‑of‑the‑art models to read standard JSON boilerplate.
I couldn’t find a tool that solved this without breaking the agent, so I wrote one. It’s called Headroom. It sits between your app and your LLM and compresses context by ~85 % without losing semantic meaning.
Open‑source – Apache‑2.0
Code: (link to repository)
Feel free to replace the placeholder “(link to repository)” with the actual URL to the Headroom codebase.
Why Truncation and Summarization Don’t Work
When the context window fills up, the industry standard is truncation (chopping off the oldest messages or the middle of the document). For an agent, truncation is dangerous:
- Log files: Cutting the middle of a log may discard the single error line that explains a crash.
- File lists: Removing entries can hide the exact configuration file the user requested.
I tried summarization (using a cheaper model to summarize the data first), but that introduced hallucination. A summarizer told me a deployment “looked fine” because it ignored specific error codes in the raw log.
What we need: a third option—lossless compression, or at least “intent‑lossless”.
The Core Idea: Statistical Analysis, Not Blind Truncation
I realized that ~90 % of the data in a tool output is just schema scaffolding. The LLM doesn’t need to see status: active repeated a thousand times; it needs the anomalies.
Headroom’s SmartCrusher runs statistical analysis before touching your data:
- Constant Factoring – If every item in an array has
"type": "file", the constant is extracted once instead of being repeated. - Outlier Detection – Calculates the standard deviation of numeric fields and preserves spikes (> 2σ from the mean). Those spikes are usually what matters.
- Error Preservation – Hard rule: never discard strings that look like stack traces, error messages, or failures. Errors are sacred.
- Relevance Scoring – Items matching the user’s query are kept, using a hybrid BM25 + semantic‑embedding score.
- First/Last Retention – Always keeps the first few and last few items; the LLM expects some examples, and recency matters.
Result:
40 000 tokens → ~4 000 tokens. Same information density, no hallucination risk.
CCR: Making Compression Reversible
Key insight: compression should be reversible.
The architecture is called CCR (Compress‑Cache‑Retrieve).
| Step | Description |
|---|---|
| 1. Compress | SmartCrusher compresses the tool output (e.g., 2 000 items → 20). |
| 2. Cache | The original 2 000 items are cached locally (5‑minute TTL, LRU eviction). |
| 3. Retrieve | Headroom injects a tool headroom_retrieve() into the LLM’s context. If the model needs more data after looking at the summary, it can call that tool. Headroom fetches the required items from the cache and returns them. |
Why it matters
- You can compress aggressively (90 %+ reduction) because nothing is ever truly lost—the model can always “unzip” what it needs.
- The risk calculus shifts: the LLM can request the full data on‑demand, eliminating the “information‑loss” problem.
Example conversation
Turn 1: "Search for all Python files"
→ 1 000 files returned, compressed to 15 items
Turn 5: "Actually, what was that file handling JWT tokens?"
→ LLM calls headroom_retrieve("jwt")
→ Returns jwt_handler.py from cached data
- No extra API calls.
- No “sorry, I don’t have that information anymore.”
TOIN: The Network Effect
Headroom learns from compression patterns via TOIN (Tool Output Intelligence Network). It anonymously tracks what happens after compression:
- Which fields are retrieved most often?
- Which tool types have high retrieval rates?
- What query patterns trigger retrievals?
This data feeds back into compression recommendations. For example, if TOIN learns that users frequently retrieve the error_code field after compression, it tells SmartCrusher to preserve error_code more aggressively the next time.
Privacy built‑in
| Aspect | Implementation |
|---|---|
| Data values | Never stored |
| Tool identifiers | Stored as structure hashes |
| Field names | Stored as SHA‑256[:8] hashes |
| User tracking | No user identifiers |
The network effect
More users → more compression events → better recommendations for everyone.
Memory: Cross‑Conversation Learning
Agents often need to remember facts across conversations (e.g., “I prefer dark mode”, “My timezone is PST”, “I’m working on the auth refactor”). Headroom provides a memory system that extracts and stores these facts automatically.
Fast Memory (recommended)
- Zero extra latency – the LLM outputs a
memoryblock inline with its response. - Headroom parses the block and stores the memory for future requests.
from headroom.memory import with_fast_memory
client = with_fast_memory(OpenAI(), user_id="alice")
# Memories are extracted automatically from responses
# and injected into future requests
Background Memory
- A separate LLM call extracts memories asynchronously.
- More accurate, but adds a small amount of latency.
from headroom import with_memory
client = with_memory(OpenAI(), user_id="alice")
Both approaches store memories locally (SQLite) and inject them into subsequent conversations, allowing the model to “remember” without any external service.
TL;DR
- Headroom compresses tool output by ~85 % using statistical analysis, while preserving anomalies, errors, and relevance.
- CCR (Compress‑Cache‑Retrieve) makes compression reversible, allowing the LLM to fetch the raw data on demand.
- TOIN learns from collective usage to improve future compression.
- Built‑in memory lets agents retain cross‑conversation facts with zero or minimal latency.
Give it a try, and let the context window work for you, not against you.
The Transform Pipeline
Headroom runs four transforms on each request:
1. CacheAligner
LLM providers offer cached‑token pricing (Anthropic: 90 % off, OpenAI: 50 % off). Caching only works if your prompt prefix is stable.
Problem – Your system prompt probably contains a timestamp, e.g.:
Current time: 2024-01-15 10:32:45
That breaks caching.
Solution – CacheAligner extracts dynamic content and moves it to the end, stabilising the prefix. The same information is retained, but cache hits improve.
2. SmartCrusher
The statistical compression engine. It analyses arrays, detects patterns, preserves anomalies, and factors constants.
3. ContentRouter
Different content needs different compression. Code isn’t JSON, isn’t logs, isn’t prose.
ContentRouter uses ML‑based content detection to route data to specialised compressors:
| Content Type | Compressor |
|---|---|
| Code | AST‑aware compression (tree‑sitter) |
| JSON | SmartCrusher |
| Logs | LogCompressor (clusters similar messages) |
| Text | Optional LLMLingua integration (≈20× compression, adds latency) |
4. RollingWindow
When context exceeds the model limit, something has to go. RollingWindow drops the oldest tool calls + responses together (never orphaning data), while preserving the system prompt and recent turns.
Three Ways to Use It
Option 1: Proxy Server (Zero Code Changes)
pip install headroom-ai
headroom proxy --port 8787
Point your OpenAI client to http://localhost:8787/v1. Done.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1")
# No other changes
Works with Claude Code, Cursor, any OpenAI‑compatible client.
Option 2: SDK Wrapper
from headroom import HeadroomClient
from openai import OpenAI
client = HeadroomClient(OpenAI())
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
headroom_mode="optimize" # or "audit" or "simulate"
)
Modes
| Mode | Description |
|---|---|
| audit | Observe only. Logs what would be optimised; does not change anything. |
| optimize | Apply compression – this is what saves tokens. |
| simulate | Dry run. Returns the optimised messages without calling the API. |
Start with audit to see potential savings, then flip to optimize when you’re confident.
Option 3: Framework Integrations
LangChain
from langchain_openai import ChatOpenAI
from headroom.integrations.langchain import HeadroomChatModel
base_model = ChatOpenAI(model="gpt-4o")
model = HeadroomChatModel(base_model, mode="optimize")
# Use in any chain or agent
chain = prompt | model | parser
Agno
from agno.agent import Agent
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(original_model, mode="optimize")
agent = Agent(model=model, tools=[...])
MCP (Model Context Protocol)
from headroom.integrations.mcp import compress_tool_result
# Compress any tool result before returning to LLM
compressed = compress_tool_result(tool_name, result_data)
Real Numbers
| Workload | Before (tokens) | After (tokens) | Savings |
|---|---|---|---|
| Log Analysis | 22 000 | 3 300 | 85 % |
| Code Search | 45 000 | 4 500 | 90 % |
| Database Queries | 18 000 | 2 700 | 85 % |
| Long Conversations | 80 000 | 32 000 | 60 % |
What’s Coming Next
More Frameworks
- CrewAI integration
- AutoGen integration
- Semantic Kernel integration
Managed Storage
- Cloud‑hosted TOIN backend (opt‑in)
- Cross‑device memory sync
- Team‑shared compression patterns
Better Compression
- Domain‑specific profiles (SRE, coding, data analysis)
- Custom compressor plugins
- Streaming compression for real‑time tools
Why I Built This
I believe we’re in the “optimization phase” of the AI hype cycle. Getting things to work is table stakes; getting them to work cheaply and reliably is the real engineering work.
Headroom tackles the “context bloat” problem with statistical analysis and reversible compression—not with heuristics or blunt truncation. It runs entirely locally, so no data leaves your machine (aside from the usual OpenAI/Anthropic calls).
- License: Apache‑2.0
- Repo: GitHub – headroom (replace with the actual URL)
If you find bugs or have ideas, please open an issue. I’m actively maintaining this project.