Foundations & Architecture 10 min read

Temperature, Top-p, Top-k: The Sampling Decisions That Shape Every LLM Response

After the forward pass, a distribution. Before the output, a sampling choice. Temperature divides logits. Top-k keeps the top tokens. Top-p adapts to distribution shape. Implemented from scratch, compared on the same distribution, with production default recommendations for different tasks.

After the forward pass, an LLM produces a distribution over its entire vocabulary for the next token. Which token you select from that distribution — and how — determines the character of the output. A temperature change alone can turn a confident, deterministic response into a creative, surprising one. A top-p cutoff eliminates the long tail of improbable tokens. The decisions compound over hundreds of tokens. This post makes these mechanics precise with implementation and experiments.

Temperature: scale the logits before softmax

Logits are the raw scores before softmax. Temperature divides the logits: prob(token) = softmax(logit / T). With T=1 (default), you get the natural distribution. With T<1, the distribution sharpens: the highest-probability token dominates. With T>1, the distribution flattens: all tokens become more equiprobable. T=0 is greedy decoding — always take the argmax.

import numpy as np

def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()

def greedy(logits):
    return np.argmax(logits)

def temperature_sample(logits, T=1.0):
    probs = softmax(logits / T)
    return np.random.choice(len(probs), p=probs)

def top_k_sample(logits, k=50, T=1.0):
    """Keep only the top-k logits; sample from those."""
    top_k_idx = np.argpartition(logits, -k)[-k:]
    filtered = np.full_like(logits, -np.inf)
    filtered[top_k_idx] = logits[top_k_idx]
    probs = softmax(filtered / T)
    return np.random.choice(len(probs), p=probs)

def top_p_sample(logits, p=0.9, T=1.0):
    """Nucleus sampling: keep the smallest set of tokens whose cumulative prob >= p."""
    probs = softmax(logits / T)
    sorted_idx = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_idx]
    cumsum = np.cumsum(sorted_probs)
    # Keep tokens until we exceed p
    cutoff = np.searchsorted(cumsum, p) + 1
    filtered = np.full_like(logits, -np.inf)
    filtered[sorted_idx[:cutoff]] = logits[sorted_idx[:cutoff]]
    filtered_probs = softmax(filtered / T)
    return np.random.choice(len(filtered_probs), p=filtered_probs)

# ── Simulation: compare strategies on the same distribution ──────────────────
np.random.seed(42)
vocab_size = 20
# Skewed distribution: one token is strongly preferred (like 'the' in a sentence)
logits = np.random.randn(vocab_size) * 2
logits[0] = 5.0    # token 0 is clearly best
logits[1] = 3.0    # token 1 is second

print("Logits (top 5):", np.sort(logits)[::-1][:5].round(2))
print(f"Base probs (T=1): {softmax(logits)[:5].round(3)}
")

def sample_n(strategy_fn, n=1000, **kwargs):
    counts = np.zeros(vocab_size)
    for _ in range(n):
        tok = strategy_fn(logits.copy(), **kwargs)
        counts[tok] += 1
    top3 = np.argsort(counts)[::-1][:3]
    return {int(i): int(counts[i]) for i in top3}

print(f"Greedy (always):                   token {greedy(logits)}")
print(f"T=0.1 (1000 samples, top tokens):  {sample_n(temperature_sample, T=0.1)}")
print(f"T=1.0 (1000 samples, top tokens):  {sample_n(temperature_sample, T=1.0)}")
print(f"T=2.0 (1000 samples, top tokens):  {sample_n(temperature_sample, T=2.0)}")
print(f"top_k=5 (1000 samples):            {sample_n(top_k_sample, k=5)}")
print(f"top_p=0.9 (1000 samples):          {sample_n(top_p_sample, p=0.9)}")

Top-k: fixed nucleus size

Top-k sampling keeps only the k highest-probability tokens and samples from them. The problem: k is a fixed number regardless of the shape of the distribution. If the model is very confident (one token at 95% probability), top-k=50 includes 49 near-zero-probability tokens, creating noise. If the model is very uncertain (uniform distribution), top-k=50 cuts off most reasonable options. k=50 is commonly used but it is an awkward hyperparameter because it does not adapt to distribution shape.

Top-p (nucleus sampling): adaptive nucleus

Top-p keeps the smallest set of tokens whose cumulative probability exceeds p. With p=0.9, when the model is confident (one token at 95%), the nucleus has 1 token. When the model is uncertain, the nucleus expands to include many tokens. This adapts to the distribution shape automatically — which is why top-p is generally preferred over top-k for text generation.

Production defaults and when to change them

Most production APIs default to temperature=0.7-1.0, top-p=0.9-0.95, no top-k. For factual question answering (RAG, structured extraction): use low temperature (0.1-0.3) or greedy (T=0). For creative tasks (copy, brainstorming): use temperature=0.8-1.2, top-p=0.9. For code generation: temperature=0.2-0.5 (need syntactic correctness, some exploration). The interaction matters: if temperature is very low, top-p has little effect because the distribution is already peaked. Temperature is the primary control; top-p is a safety net.

Repetition penalty: adds a penalty to tokens already generated, reducing the probability of the model repeating itself. Useful for long generation tasks but can cause incoherence if set too high. min-p (a newer alternative to top-p): keep tokens whose probability is at least p_min × max_prob. More robust to extreme distributions than top-p.

Run an experiment: generate 20 completions of the same prompt with T=0.1 vs T=1.5. At T=0.1, completions should be nearly identical. At T=1.5, they should vary substantially — and some should be incoherent. This gives you direct intuition for how temperature controls the creativity/reliability tradeoff, which is the decision you make every time you set generation parameters for a production system.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →