Stop Feeding 'Junk' Tokens to Your LLM. (I Built a Proxy to Fix It)

Published: 3 weeks ago (January 17, 2026 at 07:21 PM EST)

7 min read

Source: Dev.to – “Stop feeding junk tokens to your LLM – I built a proxy to fix it”

Overview

I recently built an agent to handle some SRE tasks—fetching logs, querying databases, searching code. It worked, but when I looked at the traces I was annoyed.

It wasn’t just that it was expensive (the bill was climbing). It was the sheer inefficiency.

A single tool output—a search for Python files—was 40 000 tokens.
About 35 000 tokens were just "type": "file" and "language": "python" repeated 2 000 times.

We were paying premium compute prices to force state‑of‑the‑art models to read standard JSON boilerplate.

I couldn’t find a tool that solved this without breaking the agent, so I wrote one. It’s called Headroom. It sits between your app and your LLM and compresses context by ~85 % without losing semantic meaning.

Open‑source – Apache‑2.0
Code: (link to repository)

Feel free to replace the placeholder “(link to repository)” with the actual URL to the Headroom codebase.

Why Truncation and Summarization Don’t Work

When the context window fills up, the industry standard is truncation (chopping off the oldest messages or the middle of the document). For an agent, truncation is dangerous:

Log files: Cutting the middle of a log may discard the single error line that explains a crash.
File lists: Removing entries can hide the exact configuration file the user requested.

I tried summarization (using a cheaper model to summarize the data first), but that introduced hallucination. A summarizer told me a deployment “looked fine” because it ignored specific error codes in the raw log.

What we need: a third option—lossless compression, or at least “intent‑lossless”.

I realized that ~90 % of the data in a tool output is just schema scaffolding. The LLM doesn’t need to see status: active repeated a thousand times; it needs the anomalies.

Headroom’s SmartCrusher runs statistical analysis before touching your data:

Constant Factoring – If every item in an array has "type": "file", the constant is extracted once instead of being repeated.
Outlier Detection – Calculates the standard deviation of numeric fields and preserves spikes (> 2σ from the mean). Those spikes are usually what matters.
Error Preservation – Hard rule: never discard strings that look like stack traces, error messages, or failures. Errors are sacred.
Relevance Scoring – Items matching the user’s query are kept, using a hybrid BM25 + semantic‑embedding score.
First/Last Retention – Always keeps the first few and last few items; the LLM expects some examples, and recency matters.

Result:
40 000 tokens → ~4 000 tokens. Same information density, no hallucination risk.

CCR: Making Compression Reversible

Key insight: compression should be reversible.
The architecture is called CCR (Compress‑Cache‑Retrieve).

Step	Description
1. Compress	`SmartCrusher` compresses the tool output (e.g., 2 000 items → 20).
2. Cache	The original 2 000 items are cached locally (5‑minute TTL, LRU eviction).
3. Retrieve	Headroom injects a tool `headroom_retrieve()` into the LLM’s context. If the model needs more data after looking at the summary, it can call that tool. Headroom fetches the required items from the cache and returns them.

Why it matters

You can compress aggressively (90 %+ reduction) because nothing is ever truly lost—the model can always “unzip” what it needs.
The risk calculus shifts: the LLM can request the full data on‑demand, eliminating the “information‑loss” problem.

Example conversation

Turn 1: "Search for all Python files"
        → 1 000 files returned, compressed to 15 items

Turn 5: "Actually, what was that file handling JWT tokens?"
        → LLM calls headroom_retrieve("jwt")
        → Returns jwt_handler.py from cached data

No extra API calls.
No “sorry, I don’t have that information anymore.”

TOIN: The Network Effect

Headroom learns from compression patterns via TOIN (Tool Output Intelligence Network). It anonymously tracks what happens after compression:

Which fields are retrieved most often?
Which tool types have high retrieval rates?
What query patterns trigger retrievals?

This data feeds back into compression recommendations. For example, if TOIN learns that users frequently retrieve the error_code field after compression, it tells SmartCrusher to preserve error_code more aggressively the next time.

Privacy built‑in

Aspect	Implementation
Data values	Never stored
Tool identifiers	Stored as structure hashes
Field names	Stored as SHA‑256[:8] hashes
User tracking	No user identifiers

The network effect

More users → more compression events → better recommendations for everyone.

Memory: Cross‑Conversation Learning

Agents often need to remember facts across conversations (e.g., “I prefer dark mode”, “My timezone is PST”, “I’m working on the auth refactor”). Headroom provides a memory system that extracts and stores these facts automatically.

Fast Memory (recommended)

Zero extra latency – the LLM outputs a memory block inline with its response.
Headroom parses the block and stores the memory for future requests.

from headroom.memory import with_fast_memory
client = with_fast_memory(OpenAI(), user_id="alice")

# Memories are extracted automatically from responses
# and injected into future requests

Background Memory

A separate LLM call extracts memories asynchronously.
More accurate, but adds a small amount of latency.

from headroom import with_memory
client = with_memory(OpenAI(), user_id="alice")

Both approaches store memories locally (SQLite) and inject them into subsequent conversations, allowing the model to “remember” without any external service.

TL;DR

Headroom compresses tool output by ~85 % using statistical analysis, while preserving anomalies, errors, and relevance.
CCR (Compress‑Cache‑Retrieve) makes compression reversible, allowing the LLM to fetch the raw data on demand.
TOIN learns from collective usage to improve future compression.
Built‑in memory lets agents retain cross‑conversation facts with zero or minimal latency.

Give it a try, and let the context window work for you, not against you.

The Transform Pipeline

Headroom runs four transforms on each request:

1. CacheAligner

LLM providers offer cached‑token pricing (Anthropic: 90 % off, OpenAI: 50 % off). Caching only works if your prompt prefix is stable.

Problem – Your system prompt probably contains a timestamp, e.g.:

Current time: 2024-01-15 10:32:45

That breaks caching.

Solution – CacheAligner extracts dynamic content and moves it to the end, stabilising the prefix. The same information is retained, but cache hits improve.

2. SmartCrusher

The statistical compression engine. It analyses arrays, detects patterns, preserves anomalies, and factors constants.

3. ContentRouter

Different content needs different compression. Code isn’t JSON, isn’t logs, isn’t prose.

ContentRouter uses ML‑based content detection to route data to specialised compressors:

Content Type	Compressor
Code	AST‑aware compression (tree‑sitter)
JSON	SmartCrusher
Logs	LogCompressor (clusters similar messages)
Text	Optional LLMLingua integration (≈20× compression, adds latency)

4. RollingWindow

When context exceeds the model limit, something has to go. RollingWindow drops the oldest tool calls + responses together (never orphaning data), while preserving the system prompt and recent turns.

Three Ways to Use It

Option 1: Proxy Server (Zero Code Changes)

pip install headroom-ai
headroom proxy --port 8787

Point your OpenAI client to http://localhost:8787/v1. Done.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8787/v1")
# No other changes

Works with Claude Code, Cursor, any OpenAI‑compatible client.

Option 2: SDK Wrapper

from headroom import HeadroomClient
from openai import OpenAI

client = HeadroomClient(OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    headroom_mode="optimize"   # or "audit" or "simulate"
)

Modes

Mode	Description
audit	Observe only. Logs what would be optimised; does not change anything.
optimize	Apply compression – this is what saves tokens.
simulate	Dry run. Returns the optimised messages without calling the API.

Start with audit to see potential savings, then flip to optimize when you’re confident.

Option 3: Framework Integrations

LangChain

from langchain_openai import ChatOpenAI
from headroom.integrations.langchain import HeadroomChatModel

base_model = ChatOpenAI(model="gpt-4o")
model = HeadroomChatModel(base_model, mode="optimize")

# Use in any chain or agent
chain = prompt | model | parser

Agno

from agno.agent import Agent
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(original_model, mode="optimize")
agent = Agent(model=model, tools=[...])

MCP (Model Context Protocol)

from headroom.integrations.mcp import compress_tool_result

# Compress any tool result before returning to LLM
compressed = compress_tool_result(tool_name, result_data)

Real Numbers

Workload	Before (tokens)	After (tokens)	Savings
Log Analysis	22 000	3 300	85 %
Code Search	45 000	4 500	90 %
Database Queries	18 000	2 700	85 %
Long Conversations	80 000	32 000	60 %

What’s Coming Next

More Frameworks

CrewAI integration
AutoGen integration
Semantic Kernel integration

Managed Storage

Cloud‑hosted TOIN backend (opt‑in)
Cross‑device memory sync
Team‑shared compression patterns

Better Compression

Domain‑specific profiles (SRE, coding, data analysis)
Custom compressor plugins
Streaming compression for real‑time tools

Why I Built This

I believe we’re in the “optimization phase” of the AI hype cycle. Getting things to work is table stakes; getting them to work cheaply and reliably is the real engineering work.

Headroom tackles the “context bloat” problem with statistical analysis and reversible compression—not with heuristics or blunt truncation. It runs entirely locally, so no data leaves your machine (aside from the usual OpenAI/Anthropic calls).

License: Apache‑2.0
Repo: GitHub – headroom (replace with the actual URL)

If you find bugs or have ideas, please open an issue. I’m actively maintaining this project.

Stop Feeding 'Junk' Tokens to Your LLM. (I Built a Proxy to Fix It)

Overview

Why Truncation and Summarization Don’t Work

The Core Idea: Statistical Analysis, Not Blind Truncation

CCR: Making Compression Reversible

Why it matters

Example conversation

TOIN: The Network Effect

Privacy built‑in

The network effect

Memory: Cross‑Conversation Learning

Fast Memory (recommended)

Background Memory

TL;DR

The Transform Pipeline

1. CacheAligner

2. SmartCrusher

3. ContentRouter

4. RollingWindow

Three Ways to Use It

Option 1: Proxy Server (Zero Code Changes)

Option 2: SDK Wrapper

Modes

Option 3: Framework Integrations

LangChain

Agno

MCP (Model Context Protocol)

Real Numbers

What’s Coming Next

More Frameworks

Managed Storage

Better Compression

Why I Built This

Related posts

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

Show HN: Intent Layer: A context engineering skill for AI agents

Overview

Why Truncation and Summarization Don’t Work

The Core Idea: Statistical Analysis, Not Blind Truncation

CCR: Making Compression Reversible

Why it matters

Example conversation

TOIN: The Network Effect

Privacy built‑in

The network effect

Memory: Cross‑Conversation Learning

Fast Memory (recommended)

Background Memory

TL;DR

The Transform Pipeline

1. CacheAligner

2. SmartCrusher

3. ContentRouter

4. RollingWindow

Three Ways to Use It

Option 1: Proxy Server (Zero Code Changes)

Option 2: SDK Wrapper

Modes

Option 3: Framework Integrations

LangChain

Agno

MCP (Model Context Protocol)

Real Numbers

What’s Coming Next

More Frameworks

Managed Storage

Better Compression

Why I Built This

Related posts

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

Show HN: Intent Layer: A context engineering skill for AI agents

Option 1: Proxy Server (Zero Code Changes)

Option 2: SDK Wrapper

Option 3: Framework Integrations