Foundations & Architecture 7 min read

Temperature, Top-P, Top-K: How LLMs Actually Choose the Next Word

Greedy, beam search, nucleus sampling — what each one does, when randomness helps, and why temperature 0 isn't always the right answer.

Every time an LLM generates a word, it runs a lottery. The model assigns a probability to every token in its vocabulary — tens of thousands of options — and then picks one. The question is: how do you want that lottery to be rigged?

Temperature, top-K, and top-P are the three knobs that control this. Understanding them is the difference between outputs that feel robotically repetitive and outputs that feel creatively alive — or between outputs that are reliably correct and outputs that hallucinate.

How autoregressive generation works

LLMs generate text one token at a time. After processing your entire prompt, the model produces a probability distribution over its vocabulary for the next token. It samples one, appends it, then runs the whole thing again — repeating until it hits a stop token or your max_tokens limit.

The raw outputs before softmax are called logits. Softmax converts logits into probabilities that sum to 1. Sampling strategies operate on these probabilities.

Prompt: "The weather today is"

Token distribution (top 5):
  "sunny"     → 0.38
  "cold"      → 0.21
  "nice"      → 0.14
  "warm"      → 0.11
  "cloudy"    → 0.08
  ... (+ 99,995 other tokens with tiny probs)

Which one gets chosen? Depends on your sampling strategy.

Greedy decoding

The simplest strategy: always pick the highest-probability token. Deterministic, fast, reproducible.

The problem: greedy decoding creates degenerate repetition loops. Once the model picks a high-probability path, it keeps reinforcing it. "The the the the the" is technically a valid greedy output on certain prompts because "the" is always high probability after itself.

Greedy decoding (temperature=0) is useful for: code generation, structured JSON output, factual Q&A where you want deterministic answers. It's a bad default for conversational AI.

Temperature

Temperature divides all logits by a constant T before applying softmax. This reshapes the probability distribution:

T < 1.0 — sharpens the distribution. The most likely token gets even more probability mass. More deterministic, less creative.
T = 1.0 — no change. Raw model probabilities.
T > 1.0 — flattens the distribution. Low-probability tokens get relatively more likely. More random, more creative, also more likely to be incoherent.
T → 0 — approaches greedy decoding.
T → ∞ — approaches uniform random sampling from the entire vocabulary.

Temperature	Feel	Good for
0.0	Deterministic, robotic	Code, structured output, evals
0.1–0.3	Focused, consistent	Factual Q&A, classification
0.4–0.7	Balanced	Most chat applications
0.8–1.0	Creative, varied	Writing, brainstorming
1.0+	Unpredictable	Experimental use only

Temperature 0 does NOT mean "no creativity". It means "always the same answer". Run the same prompt 100 times at temperature 0 and you get 100 identical outputs. Useful for evals; dangerous for conversations.

Top-K sampling

Before sampling, truncate the distribution to only the K most probable tokens. Set everything else to zero, renormalise, then sample.

K=50 is a common default. This prevents the model from ever selecting one of the 99,950 garbage tokens at the bottom of the distribution, while still allowing variety among the top candidates.

The weakness: K=50 is equally applied whether the top token has 95% probability or 5% probability. When one token clearly dominates, forcing 50 options adds unnecessary randomness.

Top-P (nucleus) sampling

Instead of a fixed K, take the smallest set of tokens whose cumulative probability adds up to at least P, then sample from that set.

P=0.9: rank tokens by probability, keep adding until you've covered 90% of the mass. If one token has 92% probability, nucleus size = 1. If the top 100 tokens each have 0.9% probability, nucleus size = 100.

Top-P adapts to the shape of the distribution. It's more principled than Top-K and is now the dominant sampling strategy in production systems.

Most production LLM APIs use temperature + top-P together. Temperature controls "how concentrated" the distribution is. Top-P controls "from how wide a set we sample". A common production default: temperature=0.7, top_p=0.9.

Which to use when

Use case	Temperature	Top-P	Notes
Code generation	0.0–0.2	0.95	Determinism matters; avoid garbage tokens
Factual Q&A / RAG	0.1–0.3	0.9	Consistent, grounded
Chat assistant	0.5–0.7	0.9	Balanced naturalness
Creative writing	0.8–1.0	0.95	Variety > consistency
Eval / benchmarking	0.0	1.0	Must be reproducible

I wasted two weeks thinking we had a hallucination problem. Turned out temperature was set to 1.2. The model wasn't lying — it was drunk.

Min-P sampling — the newer contender

Min-P (minimum probability) is a newer sampling strategy gaining traction in open-source communities. Instead of a fixed cutoff, it sets a threshold relative to the most probable token: if the top token has probability 0.8 and min_p=0.05, only tokens with probability ≥ 0.04 (5% of 0.8) are considered. This adapts gracefully across both high-confidence and low-confidence generation steps — better than Top-K in theory, and comparable to Top-P in practice.

Try Decoding & Sampling →: Adjust temperature and sampling strategy on live text and watch the distribution change in real time.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →