GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Context Compression: Why Dumping Raw Tool Outputs Into the LLM Burns Tokens

RAG chunks, tool outputs, and logs piped raw into the context window cost money linearly, bury the signal in the middle, and starve the window. Headroom (open-source, Apache-2.0) compresses agent context type-aware — JSON, AST-aware code, and prose routed separately — holding answers at 60–95% fewer tokens, with reversible CCR so the model can fetch originals on demand. Honest teardown of a third-party tool, including where lossy compression on precise data breaks.

Your agent calls a tool, gets back 65,000 tokens of JSON, and pipes all of it into the next LLM call to answer a question that needed 5,000. Multiply that by every tool call, every retrieved chunk, every log file in the loop — and you are paying, in money and in latency, to hand the model a haystack and ask it to find a needle it already described to you.

This is the context-engineering problem almost nobody budgets for. RAG chunks, tool outputs, stack traces, file contents — teams pipe the raw firehose straight into the window. It inflates cost linearly, pushes the real signal toward the middle where models attend to it worst, and eats the space conversation history needed.

Headroom is an open-source (Apache-2.0) context-compression layer for AI agents from Tejas Chopra. It sits between your agent and the model and compresses everything the LLM reads — tool outputs, logs, RAG chunks, files, history — before it arrives, claiming 60–95% fewer tokens at the same answers. It is a third-party tool, not an Anthropic product. The engineering idea behind it is what's worth understanding cold.

Why raw context burns tokens

How Headroom compresses — type-aware, not blind

The core design decision is that you do not compress everything the same way. A router detects the content type and sends each blob to a compressor built for it.

The part that makes it usable: reversibility

Lossy compression on data that has to be exact is dangerous. Headroom's answer is to not make it permanent.

CCR — Compress-Cache-Retrieve: the original, uncompressed content is stored locally and never deleted. The LLM is handed a `headroom_retrieve` tool, so if the compressed view dropped a detail it actually needs, it pulls the full original on demand. Compression becomes a default view, not a one-way door.

It ships four ways so it can drop into an existing stack without a rewrite: a library (`compress(messages)` in Python or TypeScript), a zero-code proxy, an `headroom wrap claude|codex|cursor` agent wrapper, and an MCP server exposing `headroom_compress` / `headroom_retrieve` / `headroom_stats`.

What they report

WorkloadBeforeAfterSaved
Code search (100 results)17,7651,40892%
SRE incident debugging65,6945,11892%
GitHub issue triage54,17414,76173%
Codebase exploration78,50241,25447%

On accuracy, their published evals report GSM8K unchanged (±0.000), TruthfulQA slightly up (+0.030), and SQuAD/BFCL holding ~97% at 19–32% compression. As with any vendor's own benchmarks: this is a reason to test on your data, not a guarantee on it.

Where it breaks — lossy compression on precise data

The takeaway isn't 'install Headroom.' It's that context is an engineered resource, not a dumping ground. Whether you use a tool or roll your own pruning, the discipline is identical: route by content type, never lossily compress data that must be exact, and always keep a path back to the original.

Explore Context Compaction →: The in-house view of the same problem — sliding-window, summary, and hierarchical compaction for long agent context.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →