Foundations & Architecture 9 min read

Prompt Engineering & Token Economics

How prompt structure affects quality, why few-shot beats zero-shot in most cases, and how to calculate real inference cost before you ship.

Prompt engineering is the most underrated skill in AI engineering. Not because it's glamorous, but because a well-structured prompt can halve your token count, double your accuracy, and save thousands of dollars a month in production.

How LLMs read your prompt

Your prompt is tokenised, embedded, and passed through every attention layer. The model has no special understanding of your intent — it's predicting the next token given everything before it. Structure matters because it shifts the probability distribution of what comes next.

The model sees your prompt as a sequence of tokens, not instructions. When you write "You are a helpful assistant", you are literally priming the distribution of likely next tokens — not flipping a switch labelled "helpful".

Zero-shot vs. few-shot vs. chain-of-thought

Technique	When to use	Token cost	Accuracy
Zero-shot	Simple, well-defined tasks	Lowest	OK for easy tasks
Few-shot	Classification, formatting, tone	Medium	Strong improvement
Chain-of-thought	Reasoning, math, multi-step	High	Best for complex tasks
CoT + few-shot	Hard reasoning with examples	Highest	Gold standard

Few-shot examples are the most reliable way to constrain output format. If you need JSON output with a specific schema, showing 2-3 examples in the prompt is more reliable than describing the schema in words.

Token economics in production

Every token costs money and adds latency. At scale, prompt bloat becomes a real cost centre. A system prompt that's 2,000 tokens instead of 500 tokens costs 4× more on the input side, and gets charged on every single call.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

system_prompt = open("system_prompt.txt").read()
avg_user_msg  = 150   # tokens
avg_response  = 400   # tokens
calls_per_day = 10_000

system_tokens = len(enc.encode(system_prompt))
input_tokens  = (system_tokens + avg_user_msg) * calls_per_day * 30
output_tokens = avg_response * calls_per_day * 30

# GPT-4o pricing (as of 2025): $5/1M input, $15/1M output
monthly_cost = (input_tokens / 1e6 * 5) + (output_tokens / 1e6 * 15)
print(f"System prompt tokens: {system_tokens}")
print(f"Monthly estimate: ${monthly_cost:,.0f}")

Practical techniques that actually work

Put instructions before examples, not after — models weight earlier tokens more heavily in long contexts
Use XML tags to separate sections: <context>, <instructions>, <examples> — models parse structure reliably
Be explicit about output format: "Respond in JSON with keys: name, score, reason" beats "give me structured output"
Negative examples outperform negative instructions: show what you DON'T want, don't just describe it
For classification, list every valid class — ambiguous cases default to whichever class sounds most probable in general text

Prompt caching (Anthropic, OpenAI) lets you cache a static prefix and pay only 10% of the input cost on repeat calls. If your system prompt is 1,000+ tokens and you're hitting the same model thousands of times per day, caching alone can cut your input costs by 80–90%.

Compare prompting techniques live →: Run zero-shot vs. few-shot vs. CoT on the same task and see output quality differences in real time.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →