AI Engineering 9 min read

Confidence Calibration: Why 'I'm Sure' Means Nothing from an LLM

Why LLM verbalized confidence ('definitely', 'I'm certain') is uncorrelated with factual accuracy. How to build calibrated uncertainty signals using logprobs, ensemble disagreement, and self-consistency checks.

GPT-3.5 was asked: 'Are you confident that Abraham Lincoln was born in Kentucky?' It replied: 'Yes, I am confident that Abraham Lincoln was born in Kentucky.' Lincoln was indeed born in Kentucky. GPT-3.5 was asked: 'Are you confident that the Battle of Hastings was in 1065?' It replied: 'Yes, I am confident that the Battle of Hastings took place in 1065.' It was 1066. Both answers expressed identical confidence. One was correct.

Verbalized confidence in LLMs is not calibrated. 'I am confident', 'definitely', 'certainly', and 'I'm sure' do not correlate with factual accuracy. They correlate with the model's prediction of what a confident-sounding answer looks like in context.

The calibration failure, measured

A well-calibrated model would be correct 90% of the time when it says it's 90% confident, 70% of the time when it says 70%, and so on. Studies of LLM verbalized confidence find that frontier models are severely overconfident: when they express 90% confidence, they're correct around 60-70% of the time. The expressed confidence is uniformly too high.

This isn't a reasoning failure — it's a training artifact. Human preference labels reward confident-sounding responses. RLHF trains models to express high confidence because confidence sounds more helpful. The result is a model that sounds certain about things it doesn't know.

What actually correlates with correctness

Token logprobs

The model's internal probability distribution over tokens is meaningfully calibrated in ways that verbalized confidence is not. When the model generates a factual claim with high token probabilities for the key entities, that claim is more likely to be correct than one generated with low token probabilities. You need logprob access to use this signal (available from OpenAI, Anthropic with extended outputs, and open-weight model APIs).

Self-consistency

Sample the same question 5 times at temperature 0.7. If all 5 samples give the same answer, that answer is more likely correct than an answer that appears in only 2/5 samples. This is expensive but reliable — self-consistency is one of the best proxy signals for factual accuracy available without external grounding.

Semantic entropy

A more sophisticated version of self-consistency: generate multiple responses and measure semantic entropy across them. High semantic entropy (many meaningfully different answers) indicates high uncertainty. Low semantic entropy (all answers say roughly the same thing) indicates the model has a strong view — which correlates (but doesn't guarantee) accuracy.

Building calibrated uncertainty into your product

Suppress verbalized confidence language: add to your system prompt 'Do not use phrases like I am confident, certainly, definitely, or I'm sure. If you are uncertain, say so explicitly.'
Require explicit uncertainty: 'If you are less than 90% sure of a factual claim, qualify it with the phrase I believe or I'm not certain.' Models follow this instruction better than you might expect.
Ground uncertain claims: any factual claim the model makes should be retrievable from your document corpus. Claims that can't be grounded should be flagged or suppressed.
Use logprobs as a QA signal: run a post-processing step that flags responses where any key entity token has probability below a threshold (e.g. 0.3). These responses are disproportionately likely to be hallucinated.

Never present LLM verbalized confidence to end users as a reliability signal. 'The AI is 95% confident' is meaningless and potentially misleading. If you need to communicate uncertainty to users, derive it from grounding coverage or self-consistency, not from the model's own words.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →