GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Cost-Aware Prompt Engineering

How to engineer prompts that perform better and cost less. Token budgets by component, the five optimisation levers (output format, few-shot, CoT, prompt caching, model routing), and the iterative optimisation loop.

There's a version of prompt engineering that's just vibes — add more words, hope the model cooperates. Then there's the version that treats your prompt like a scarce resource with a real dollar cost, and engineers it accordingly. That second version is the one that survives contact with production.

This post is about the intersection: how to write prompts that perform better *and* cost less. It turns out those goals align more than you'd think.

Your prompt runs a million times

In development, the cost of a single prompt is negligible. That creates a dangerous habit: adding context, examples, and instructions without any discipline. A system prompt that sprawls to 4,000 tokens costs $0.012 per request at GPT-4o input pricing. At 500K daily requests, that's $6,000/day — $180K/month — just for the system prompt. Discipline compounds.

The single highest-leverage optimisation in production LLM systems is almost always: trim the system prompt. Not the model. Not the infrastructure. The prompt.

Token budgets by component

ComponentTypical rangeNotes
System prompt200–2,000 tokensPaid on every request. Cache it if static and >1024 tokens.
User message10–500 tokensYou can't control this, but you can set max_length on inputs
RAG context500–8,000 tokensThe biggest variable. Retrieval precision directly reduces this.
Chat history0–50,000 tokensGrows unboundedly without compaction. The silent cost killer.
Output100–2,000 tokensPriced 3–5× higher than input. Set max_tokens. Use streaming.

The five prompt engineering levers

1. Be specific about output format

Vague instructions produce verbose outputs. "Summarise this" might yield 800 tokens. "Summarise in 3 bullet points, max 20 words each" yields 60 tokens. You get a better result *and* spend 90% less on output tokens. Specificity is not just a quality lever — it's a cost lever.

2. Few-shot examples: the quality/cost tradeoff

Few-shot examples dramatically improve quality on nuanced tasks — but they're paid every request. Three examples at 300 tokens each add 900 tokens per request. Evaluate: can you get the same quality with one example? With a better zero-shot instruction? With fine-tuning? At high volume, fine-tuning on your few-shot examples pays off faster than you'd think.

3. Chain-of-thought: spend tokens to save retries

CoT prompting — asking the model to reason step-by-step — increases output tokens by 2–5×. But it can reduce error rates by 30–60% on reasoning tasks. If getting it wrong means a human escalation or a retry, CoT often saves net tokens. Use CoT on tasks where wrong answers are expensive. Skip it on tasks where speed matters more than reasoning depth.

4. Prompt caching

If your system prompt + any static RAG context is over 1,024 tokens and identical across requests, enable prompt caching. Anthropic caches at 90% discount on input tokens. OpenAI at 50%. For a 3,000-token cached prefix at 1M requests/day, caching saves ~$9,000–$18,000/day depending on provider. It takes 20 minutes to implement.

response = client.messages.create(
    model="claude-opus-4-6",
    system=[{
        "type": "text",
        "text": your_long_system_prompt,
        "cache_control": {"type": "ephemeral"}  # mark for caching
    }],
    messages=[{"role": "user", "content": user_message}],
    max_tokens=1024
)
# Check cache hit in response.usage.cache_read_input_tokens

5. Model routing by complexity

Not all requests need GPT-4o. A support ticket classifier, a yes/no safety check, a template fill — these are GPT-4o-mini or Claude Haiku tasks. The quality difference is negligible. The cost difference is 10–30×. Build a lightweight complexity classifier that routes simple requests to cheaper models, and reserve the frontier model for tasks that genuinely need it.

The prompt optimisation loop

Good prompt engineering is empirical, not intuitive. The loop: write a prompt, run it against your eval set, measure quality score AND token count, iterate. You're optimising a two-objective function. Document every version in git. Never ship a prompt change without running evals first.

Use an automated prompt optimiser like DSPy or TextGrad for high-volume prompts. These tools iterate prompts against your eval set automatically, finding formulations that score better and often use fewer tokens.

Try the Prompt playground →: Compare prompt variants side-by-side with live token counts.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →