Prompt Engineering & Token Economics
How prompt structure affects quality, why few-shot beats zero-shot in most cases, and how to calculate real inference cost before you ship.
Prompt engineering is the most underrated skill in AI engineering. Not because it's glamorous, but because a well-structured prompt can halve your token count, double your accuracy, and save thousands of dollars a month in production.
How LLMs read your prompt
Your prompt is tokenised, embedded, and passed through every attention layer. The model has no special understanding of your intent — it's predicting the next token given everything before it. Structure matters because it shifts the probability distribution of what comes next.
The model sees your prompt as a sequence of tokens, not instructions. When you write "You are a helpful assistant", you are literally priming the distribution of likely next tokens — not flipping a switch labelled "helpful".
Zero-shot vs. few-shot vs. chain-of-thought
| Technique | When to use | Token cost | Accuracy |
|---|---|---|---|
| Zero-shot | Simple, well-defined tasks | Lowest | OK for easy tasks |
| Few-shot | Classification, formatting, tone | Medium | Strong improvement |
| Chain-of-thought | Reasoning, math, multi-step | High | Best for complex tasks |
| CoT + few-shot | Hard reasoning with examples | Highest | Gold standard |
Few-shot examples are the most reliable way to constrain output format. If you need JSON output with a specific schema, showing 2-3 examples in the prompt is more reliable than describing the schema in words.
Token economics in production
Every token costs money and adds latency. At scale, prompt bloat becomes a real cost centre. A system prompt that's 2,000 tokens instead of 500 tokens costs 4× more on the input side, and gets charged on every single call.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
system_prompt = open("system_prompt.txt").read()
avg_user_msg = 150 # tokens
avg_response = 400 # tokens
calls_per_day = 10_000
system_tokens = len(enc.encode(system_prompt))
input_tokens = (system_tokens + avg_user_msg) * calls_per_day * 30
output_tokens = avg_response * calls_per_day * 30
# GPT-4o pricing (as of 2025): $5/1M input, $15/1M output
monthly_cost = (input_tokens / 1e6 * 5) + (output_tokens / 1e6 * 15)
print(f"System prompt tokens: {system_tokens}")
print(f"Monthly estimate: ${monthly_cost:,.0f}")
Practical techniques that actually work
- Put instructions before examples, not after — models weight earlier tokens more heavily in long contexts
- Use XML tags to separate sections: <context>, <instructions>, <examples> — models parse structure reliably
- Be explicit about output format: "Respond in JSON with keys: name, score, reason" beats "give me structured output"
- Negative examples outperform negative instructions: show what you DON'T want, don't just describe it
- For classification, list every valid class — ambiguous cases default to whichever class sounds most probable in general text
Prompt caching (Anthropic, OpenAI) lets you cache a static prefix and pay only 10% of the input cost on repeat calls. If your system prompt is 1,000+ tokens and you're hitting the same model thousands of times per day, caching alone can cut your input costs by 80–90%.
Compare prompting techniques live →: Run zero-shot vs. few-shot vs. CoT on the same task and see output quality differences in real time.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →