AI Engineering 9 min read

Cost Explosion: How a Single Prompt Change Multiplied Our API Bill by 40×

A production post-mortem: how an innocent prompt refactor removed a compression step, what the token math looked like in real time, and the guardrails that should have caught it.

On a Tuesday afternoon, our Slack alert fired: API spend was $340 in the last hour. Our daily budget was $200. By the time the on-call engineer pulled up the dashboard, it was $680. The culprit was a prompt refactor that had shipped that morning — a seemingly innocuous change that removed a compression step from our context assembly pipeline.

The refactor was reviewed. It was well-intentioned. It made the code cleaner. And it increased average token count per request from 1,200 to 47,000.

The math that got away from us

Our application retrieved relevant documents from a vector store and assembled them into a context window before calling the LLM. The old pipeline had a compression step: after retrieval, a small summarization call would condense each retrieved chunk to ~200 tokens. The refactor removed this step because it added latency.

What we didn't account for: the average retrieved chunk was 3,900 tokens. We retrieved 12 chunks per request. Old pipeline: 12 × 200 = 2,400 tokens of context. New pipeline: 12 × 3,900 = 46,800 tokens of context. At GPT-4's pricing, this was a 39× cost increase per request.

Token count is not a metric most engineers monitor in staging. It should be. A 10% increase in average response quality is invisible in staging; a 40× increase in token count is also invisible until you see the invoice.

What cost explosion looks like in practice

Cost explosions don't always come from a single obvious change. Common triggers:

Removing a compression or truncation step (as above)
Switching from chunk-level to document-level retrieval without adjusting the number of retrieved items
Adding few-shot examples to a system prompt that's called on every request
A loop bug that triggers the same LLM call N times instead of once
A model upgrade where input token pricing is significantly higher (GPT-4 vs GPT-3.5 was a 20× cost difference)
A user input that's much longer than your p99 test cases (pasting an entire document into a chat field)

The guardrails that should have been in place

1. Token count logging and alerting

Every LLM API call should log input_tokens, output_tokens, and estimated_cost as structured fields. Set an alert when p95 input token count exceeds 2× its 7-day average. This catches cost regressions before they become incidents.

2. Hard token caps per request

Implement a hard limit in your context assembly code: if the assembled context exceeds X tokens, truncate or compress before the API call. The limit should be a named constant, reviewed in PRs that touch the context assembly pipeline.

3. Spend rate alerting, not just daily totals

Daily budget alerts fire too late. Alert on hourly spend rate: if current_hourly_rate > daily_budget / 8, page on-call. This catches explosions within minutes, not hours.

4. Staging cost estimation

Add a CI step that runs your LLM pipeline against a sample of representative inputs and reports the estimated cost per request. Gate the PR on this number not increasing by more than 20% vs. the main branch. Tools like LangSmith, Helicone, and Portkey expose cost tracking APIs that make this straightforward.

The cheapest production safeguard: a per-request token cap in context assembly code. It's five lines. Add it before you ship to production, not after your first incident.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →