AI Engineering 9 min read

How Surprised Is the Model? Cross-Entropy, Entropy, and KL Divergence

One question unifies all of LLM training: how surprised was the model at the correct answer? Probability, entropy, cross-entropy loss, and KL divergence built from first principles for engineers who make fine-tuning and RLHF decisions.

One question unifies all of training

Every loss function in deep learning is answering one question: how surprised was the model when it saw the correct answer? If the model was very confident and correct, it is not surprised — loss is low. If it was very confident and wrong, it is extremely surprised — loss is high. Understanding this framing makes cross-entropy loss, entropy, and KL divergence all click into place as variations of the same idea.

Probability: what the model predicts

At each token position, a language model outputs a probability distribution over its entire vocabulary — say, 100,000 tokens. Each entry is the model's estimated probability that this token is the correct next one. The distribution sums to 1. If the model assigns 0.9 to 'Paris' as the next token after 'The capital of France is', it is very confident. If it assigns 0.3, it is uncertain.

This probability is a direct measure of surprise in the information-theoretic sense: high probability = low surprise = low loss. The model that says 0.9 and gets the right answer is barely updating its weights. The model that says 0.1 and gets corrected is updating heavily.

Entropy: how uncertain is the distribution?

Entropy measures the average uncertainty of a probability distribution. A distribution that puts all probability on one token has entropy 0 — it is certain. A distribution that spreads probability evenly across all 100,000 tokens has maximum entropy — it has no idea.

For language models, entropy matters in two places. First, during generation: high entropy at a position means the model is genuinely uncertain what comes next — this is where temperature affects sampling. Second, as a training diagnostic: if entropy is very high throughout training, the model is not learning structure. If it drops to near zero, the model may be memorizing rather than generalizing.

Entropy as eval signal: compute the average entropy of model outputs on your task. A well-calibrated model has moderate entropy on hard questions and low entropy on easy ones. A model with uniformly low entropy on hard questions is overconfident — and probably hallucinating.

Cross-entropy loss: how surprised at the correct answer?

Cross-entropy loss is the core training signal for language models. For each token position, it measures: given the model's probability distribution, how many bits does it take to encode the actual correct token?

The formula is: L = -log(p_correct), where p_correct is the probability the model assigned to the right answer. If p_correct = 1.0 (perfect confidence, correct), loss = 0. If p_correct = 0.5, loss = 0.69. If p_correct = 0.01, loss = 4.6. The logarithm punishes overconfident wrong predictions disproportionately — exactly the right behavior for calibration.

Perplexity is just exponentiated average cross-entropy loss: PPX = exp(avg_loss). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 equally likely options. Perplexity of 1 is perfect prediction. GPT-2 achieved perplexity ~29 on WikiText-103; GPT-4 is estimated below 5 on standard benchmarks.

KL divergence: how far is one distribution from another?

KL divergence measures the difference between two probability distributions. KL(P || Q) answers: if the true distribution is P, how much extra information do you need to encode samples using Q instead of P?

In LLM contexts, KL divergence appears in two critical places. First, in RLHF and DPO: the KL penalty term KL(policy || reference) prevents the policy model from drifting too far from the original pretrained model during reward optimization. Without this penalty, the model exploits the reward signal in degenerate ways — generating text that maximizes reward but looks nothing like natural language. The KL term is the leash.

Second, in knowledge distillation: when training a small student model from a large teacher, you minimize KL(teacher_distribution || student_distribution) rather than just cross-entropy on the labels. This forces the student to match the full probability distribution of the teacher — including its uncertainty — not just its top-1 predictions. The student learns 'I'm 60% confident, not 99%' rather than just 'the answer is Paris.''

Practical implication for RLHF: the KL coefficient β controls the tradeoff between reward maximization and staying close to the reference policy. High β = conservative, low β = aggressive. Setting β too low causes reward hacking. Setting it too high means the model barely changes from the base. Most practitioners start at β=0.1 and tune from there.

Why this matters for engineering decisions

Fine-tuning: cross-entropy loss on your target domain tells you directly how surprised the model is by your data. High loss means large update. Watch it drop — but watch for overfitting (loss on holdout stops dropping while train loss continues falling).
RLHF reward hacking: when the policy model collapses to gibberish that scores high on the reward model, the KL penalty was too low. Increase β.
Calibration: a well-calibrated model's predicted probabilities match actual accuracy rates. If it says 0.9 confidence, it should be right 90% of the time. Use temperature scaling after fine-tuning to recalibrate if needed.
Distillation vs hard labels: for small model training, using soft targets (full teacher distribution) consistently outperforms hard labels by 2–5 points on downstream tasks. The uncertainty signal is information the labels throw away.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →