Temperature, Top-p, Top-k: The Sampling Decisions That Shape Every LLM Response
After the forward pass, a distribution. Before the output, a sampling choice. Temperature divides logits. Top-k keeps the top tokens. Top-p adapts to distribution shape. Implemented from scratch, compared on the same distribution, with production default recommendations for different tasks.
After the forward pass, an LLM produces a distribution over its entire vocabulary for the next token. Which token you select from that distribution — and how — determines the character of the output. A temperature change alone can turn a confident, deterministic response into a creative, surprising one. A top-p cutoff eliminates the long tail of improbable tokens. The decisions compound over hundreds of tokens. This post makes these mechanics precise with implementation and experiments.
Temperature: scale the logits before softmax
Logits are the raw scores before softmax. Temperature divides the logits: prob(token) = softmax(logit / T). With T=1 (default), you get the natural distribution. With T<1, the distribution sharpens: the highest-probability token dominates. With T>1, the distribution flattens: all tokens become more equiprobable. T=0 is greedy decoding — always take the argmax.
import numpy as np
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
def greedy(logits):
return np.argmax(logits)
def temperature_sample(logits, T=1.0):
probs = softmax(logits / T)
return np.random.choice(len(probs), p=probs)
def top_k_sample(logits, k=50, T=1.0):
"""Keep only the top-k logits; sample from those."""
top_k_idx = np.argpartition(logits, -k)[-k:]
filtered = np.full_like(logits, -np.inf)
filtered[top_k_idx] = logits[top_k_idx]
probs = softmax(filtered / T)
return np.random.choice(len(probs), p=probs)
def top_p_sample(logits, p=0.9, T=1.0):
"""Nucleus sampling: keep the smallest set of tokens whose cumulative prob >= p."""
probs = softmax(logits / T)
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
# Keep tokens until we exceed p
cutoff = np.searchsorted(cumsum, p) + 1
filtered = np.full_like(logits, -np.inf)
filtered[sorted_idx[:cutoff]] = logits[sorted_idx[:cutoff]]
filtered_probs = softmax(filtered / T)
return np.random.choice(len(filtered_probs), p=filtered_probs)
# ── Simulation: compare strategies on the same distribution ──────────────────
np.random.seed(42)
vocab_size = 20
# Skewed distribution: one token is strongly preferred (like 'the' in a sentence)
logits = np.random.randn(vocab_size) * 2
logits[0] = 5.0 # token 0 is clearly best
logits[1] = 3.0 # token 1 is second
print("Logits (top 5):", np.sort(logits)[::-1][:5].round(2))
print(f"Base probs (T=1): {softmax(logits)[:5].round(3)}
")
def sample_n(strategy_fn, n=1000, **kwargs):
counts = np.zeros(vocab_size)
for _ in range(n):
tok = strategy_fn(logits.copy(), **kwargs)
counts[tok] += 1
top3 = np.argsort(counts)[::-1][:3]
return {int(i): int(counts[i]) for i in top3}
print(f"Greedy (always): token {greedy(logits)}")
print(f"T=0.1 (1000 samples, top tokens): {sample_n(temperature_sample, T=0.1)}")
print(f"T=1.0 (1000 samples, top tokens): {sample_n(temperature_sample, T=1.0)}")
print(f"T=2.0 (1000 samples, top tokens): {sample_n(temperature_sample, T=2.0)}")
print(f"top_k=5 (1000 samples): {sample_n(top_k_sample, k=5)}")
print(f"top_p=0.9 (1000 samples): {sample_n(top_p_sample, p=0.9)}")
Top-k: fixed nucleus size
Top-k sampling keeps only the k highest-probability tokens and samples from them. The problem: k is a fixed number regardless of the shape of the distribution. If the model is very confident (one token at 95% probability), top-k=50 includes 49 near-zero-probability tokens, creating noise. If the model is very uncertain (uniform distribution), top-k=50 cuts off most reasonable options. k=50 is commonly used but it is an awkward hyperparameter because it does not adapt to distribution shape.
Top-p (nucleus sampling): adaptive nucleus
Top-p keeps the smallest set of tokens whose cumulative probability exceeds p. With p=0.9, when the model is confident (one token at 95%), the nucleus has 1 token. When the model is uncertain, the nucleus expands to include many tokens. This adapts to the distribution shape automatically — which is why top-p is generally preferred over top-k for text generation.
Production defaults and when to change them
Most production APIs default to temperature=0.7-1.0, top-p=0.9-0.95, no top-k. For factual question answering (RAG, structured extraction): use low temperature (0.1-0.3) or greedy (T=0). For creative tasks (copy, brainstorming): use temperature=0.8-1.2, top-p=0.9. For code generation: temperature=0.2-0.5 (need syntactic correctness, some exploration). The interaction matters: if temperature is very low, top-p has little effect because the distribution is already peaked. Temperature is the primary control; top-p is a safety net.
Repetition penalty: adds a penalty to tokens already generated, reducing the probability of the model repeating itself. Useful for long generation tasks but can cause incoherence if set too high. min-p (a newer alternative to top-p): keep tokens whose probability is at least p_min × max_prob. More robust to extreme distributions than top-p.
Run an experiment: generate 20 completions of the same prompt with T=0.1 vs T=1.5. At T=0.1, completions should be nearly identical. At T=1.5, they should vary substantially — and some should be incoherent. This gives you direct intuition for how temperature controls the creativity/reliability tradeoff, which is the decision you make every time you set generation parameters for a production system.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →