AI Engineering 9 min read

Context Compression: Why Dumping Raw Tool Outputs Into the LLM Burns Tokens

RAG chunks, tool outputs, and logs piped raw into the context window cost money linearly, bury the signal in the middle, and starve the window. Headroom (open-source, Apache-2.0) compresses agent context type-aware — JSON, AST-aware code, and prose routed separately — holding answers at 60–95% fewer tokens, with reversible CCR so the model can fetch originals on demand. Honest teardown of a third-party tool, including where lossy compression on precise data breaks.

Your agent calls a tool, gets back 65,000 tokens of JSON, and pipes all of it into the next LLM call to answer a question that needed 5,000. Multiply that by every tool call, every retrieved chunk, every log file in the loop — and you are paying, in money and in latency, to hand the model a haystack and ask it to find a needle it already described to you.

This is the context-engineering problem almost nobody budgets for. RAG chunks, tool outputs, stack traces, file contents — teams pipe the raw firehose straight into the window. It inflates cost linearly, pushes the real signal toward the middle where models attend to it worst, and eats the space conversation history needed.

Headroom is an open-source (Apache-2.0) context-compression layer for AI agents from Tejas Chopra. It sits between your agent and the model and compresses everything the LLM reads — tool outputs, logs, RAG chunks, files, history — before it arrives, claiming 60–95% fewer tokens at the same answers. It is a third-party tool, not an Anthropic product. The engineering idea behind it is what's worth understanding cold.

Why raw context burns tokens

Cost is linear in tokens. A 65k-token tool result you only needed 5k of is a ~92% overspend on that call — and you make that call again and again.
'Lost in the middle': as context grows, models attend best to the start and end and worst to the middle. Dumping raw bulk doesn't just cost more, it can lower answer quality by burying the signal where the model reads least carefully.
Window contention: tokens spent on an un-pruned log are tokens unavailable for conversation history, retrieved evidence, or the agent's own scratchpad.

How Headroom compresses — type-aware, not blind

The core design decision is that you do not compress everything the same way. A router detects the content type and sends each blob to a compressor built for it.

ContentRouter — detects whether a blob is JSON, code, or prose and routes accordingly. This is the part that keeps compression honest: you never prose-summarise a JSON payload.
SmartCrusher — structural compression for JSON: arrays of dicts, nested objects, mixed types.
CodeCompressor — AST-aware compression for Python, JS, Go, Rust, Java, C++. It understands code structure instead of treating source as a bag of text.
Kompress-base — a HuggingFace model trained on agentic traces, for natural-language prose.
CacheAligner — stabilises the prompt prefix so the provider's KV cache actually hits. Compression can otherwise sabotage caching by changing the prefix on every call.

The part that makes it usable: reversibility

Lossy compression on data that has to be exact is dangerous. Headroom's answer is to not make it permanent.

CCR — Compress-Cache-Retrieve: the original, uncompressed content is stored locally and never deleted. The LLM is handed a `headroom_retrieve` tool, so if the compressed view dropped a detail it actually needs, it pulls the full original on demand. Compression becomes a default view, not a one-way door.

It ships four ways so it can drop into an existing stack without a rewrite: a library (`compress(messages)` in Python or TypeScript), a zero-code proxy, an `headroom wrap claude|codex|cursor` agent wrapper, and an MCP server exposing `headroom_compress` / `headroom_retrieve` / `headroom_stats`.

What they report

Workload	Before	After	Saved
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

On accuracy, their published evals report GSM8K unchanged (±0.000), TruthfulQA slightly up (+0.030), and SQuAD/BFCL holding ~97% at 19–32% compression. As with any vendor's own benchmarks: this is a reason to test on your data, not a guarantee on it.

Where it breaks — lossy compression on precise data

Exact-value data is the trap. A transaction ID, a legal clause, a float that must round-trip, a single config flag — if prose-style compression paraphrases or drops it, the agent answers fluently and wrong, and nothing throws. Same failure shape as a silent hallucination.
Three defenses, in order: type-aware routing (never prose-compress a JSON record), reversibility via CCR (the model can pull the exact original back), and judgment (some fields you simply never compress).
Compression isn't free. There's a model in the path; it adds its own latency and compute. On a short context it can cost more than it saves — the win is on the bloated tool-output and RAG-chunk calls, not every call.
It's young and third-party. Put it behind your own eval set before trusting it on production-critical, exact-answer workloads.

The takeaway isn't 'install Headroom.' It's that context is an engineered resource, not a dumping ground. Whether you use a tool or roll your own pruning, the discipline is identical: route by content type, never lossily compress data that must be exact, and always keep a path back to the original.

Explore Context Compaction →: The in-house view of the same problem — sliding-window, summary, and hierarchical compaction for long agent context.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →