Evaluation 11 min read

Model Calibration and ECE: When 90% Confidence Means 70% Accuracy

The calibration problem in deep learning. Expected Calibration Error from scratch, reliability diagrams, Platt scaling, and temperature scaling — the single-parameter fix that often beats complex calibration methods. Why always-calibrate on a separate held-out set.

The Confidence Calibration Problem

A model that says it's 90% confident should be right about 90% of the time. If it's right 70% of the time when it says 90%, it's overconfident. If it's right 95% of the time when it says 90%, it's underconfident. Models that aren't calibrated are dangerous in high-stakes settings: a medical diagnosis model that outputs 0.92 when the true risk is 0.65 will cause clinicians to overtrust it.

Modern deep learning models are systematically overconfident. Guo et al. (2017) showed that neural networks with batch normalization and weight decay, trained on image classification, become more overconfident as they get deeper. The same effect appears in language models.

Expected Calibration Error (ECE)

import numpy as np

def expected_calibration_error(
    confidences: np.ndarray,
    correct: np.ndarray,
    n_bins: int = 10
) -> dict:
    """
    ECE: weighted average of |accuracy - confidence| per bin.
    confidences: predicted probability for the predicted class  (N,)
    correct:     1 if prediction was correct else 0              (N,)
    """
    bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
    ece = 0.0
    bin_stats = []
    
    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (confidences > lo) & (confidences <= hi)
        if mask.sum() == 0:
            continue
        bin_conf = confidences[mask].mean()
        bin_acc  = correct[mask].mean()
        bin_frac = mask.sum() / len(confidences)
        ece += bin_frac * abs(bin_acc - bin_conf)
        bin_stats.append({"bin": f"({lo:.1f},{hi:.1f}]", "conf": round(bin_conf, 3), "acc": round(bin_acc, 3), "n": int(mask.sum())})
    
    return {"ece": round(ece, 4), "bins": bin_stats}

# Simulate overconfident model
np.random.seed(42)
confidences = np.random.uniform(0.6, 0.99, 1000)   # model outputs always high confidence
correct = (np.random.rand(1000) < 0.75).astype(int)  # but only right 75% of the time
result = expected_calibration_error(confidences, correct)
print(f"ECE = {result['ece']}")   # should be high — model is overconfident

Visualizing Calibration: Reliability Diagrams

import matplotlib.pyplot as plt

def reliability_diagram(confidences: np.ndarray, correct: np.ndarray, n_bins: int = 10):
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_centers, bin_accs, bin_sizes = [], [], []
    
    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (confidences > lo) & (confidences <= hi)
        if mask.sum() == 0: continue
        bin_centers.append((lo + hi) / 2)
        bin_accs.append(correct[mask].mean())
        bin_sizes.append(mask.sum())
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    # Left: calibration curve — diagonal = perfect calibration
    ax1.plot([0, 1], [0, 1], "k--", label="Perfect calibration")
    ax1.bar(bin_centers, bin_accs, width=0.1, alpha=0.7, label="Model accuracy per bin")
    ax1.set(xlabel="Confidence", ylabel="Accuracy", title="Reliability Diagram")
    ax1.legend()
    
    # Right: confidence histogram
    ax2.bar(bin_centers, bin_sizes, width=0.1, alpha=0.7, color="orange")
    ax2.set(xlabel="Confidence", ylabel="Count", title="Confidence Distribution")
    
    plt.tight_layout()
    return fig

Calibration Methods

Platt scaling: fit a logistic regression on top of the model's raw scores using a held-out calibration set. Simple, works well when the model is monotonically miscalibrated (just a scaling issue).

Temperature scaling: divide the logit by a learned scalar T before applying softmax. T > 1 makes the distribution softer (reduces overconfidence). T < 1 sharpens it. Guo et al. showed this single parameter often matches or beats more complex methods.

import torch
import torch.nn.functional as F

def temperature_scale(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    """Divide logits by temperature before softmax."""
    return F.softmax(logits / temperature, dim=-1)

def find_temperature(logits: torch.Tensor, labels: torch.Tensor) -> float:
    """Find T that minimises NLL on calibration set."""
    from scipy.optimize import minimize_scalar
    logits_np = logits.numpy()
    labels_np  = labels.numpy()
    
    def nll(T):
        probs = torch.softmax(torch.tensor(logits_np) / T, dim=-1).numpy()
        return -np.log(probs[np.arange(len(labels_np)), labels_np] + 1e-8).mean()
    
    result = minimize_scalar(nll, bounds=(0.1, 10.0), method="bounded")
    return result.x

# After finding T, wrap your model
T = find_temperature(val_logits, val_labels)
calibrated_probs = temperature_scale(test_logits, T)

Isotonic regression: a non-parametric monotonic fit on confidence vs. accuracy. More flexible than Platt scaling. Risk of overfitting if calibration set is small.

Always calibrate on a held-out set that was NOT used in training. Calibrating on the training set inflates ECE (the model already memorized those examples). Use a dedicated calibration split, separate from both training and test.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →