Evaluation 11 min read

Conformal Prediction: Rigorous Uncertainty Quantification Without Bayesian Assumptions

A distribution-free framework that gives a coverage guarantee: the true label will be in the prediction set at least (1-α)% of the time, for any model, any distribution. Split conformal from scratch, Mondrian CP for conditional coverage, and how to use prediction set size as a real-time uncertainty signal in production.

Conformal Prediction: Rigorous Uncertainty Quantification Without Bayesian Assumptions

Bayesian uncertainty requires a prior and a likelihood model. Calibration (ECE) is a frequentist approximation. Conformal prediction is neither — it's a distribution-free framework that gives you a coverage guarantee: the true label will be in the prediction set at least (1-α)% of the time, regardless of the model or data distribution. No assumptions required.

The Core Idea

Instead of outputting a single prediction, output a set of predictions that is guaranteed to contain the true label with probability 1-α. The set is constructed from a calibration dataset the model has never seen.

# Split Conformal Prediction (the standard approach)
# Step 1: fit model on training data
# Step 2: compute nonconformity scores on held-out calibration set
# nonconformity score = 1 - softmax_prob[true_label]  (for classification)

import numpy as np

def split_conformal(cal_scores, alpha=0.1):
    n = len(cal_scores)
    # Find the (1-alpha)(1 + 1/n) quantile of calibration scores
    q_level = np.ceil((1 - alpha) * (n + 1)) / n
    q_hat = np.quantile(cal_scores, q_level)
    return q_hat

# At test time: include all classes where 1 - softmax_prob[class] <= q_hat
def predict_set(softmax_probs, q_hat):
    scores = 1 - softmax_probs
    return np.where(scores <= q_hat)[0]  # indices of included classes

The coverage guarantee: if calibration data and test data are exchangeable (same distribution), the prediction set contains the true label with probability >= 1-α. This is a finite-sample guarantee, not an asymptotic one. It holds for any model, any loss function, any data distribution.

Why This Beats Softmax for Uncertainty

Softmax outputs: 0.97 confidence does not mean 97% accuracy. Modern neural nets are systematically overconfident. Calibration: temperature scaling improves average calibration but gives no per-prediction guarantee. Conformal prediction: the set {classes included} has guaranteed coverage. If the set is small (one class), the model is confident. If the set is large (many classes), it's uncertain. The size of the set IS the uncertainty signal. Production implication: when the prediction set has more than k classes, route to a human reviewer. This is a principled uncertainty-based deferral rule.

Mondrian Conformal Prediction: Conditional Coverage

Standard conformal gives marginal coverage: averaged over all inputs, coverage is 1-α. But you might want coverage to hold separately for each class, demographic group, or input type. Mondrian conformal splits the calibration set by group and computes a separate threshold per group.

This matters in production when coverage needs to be uniform across groups — medical diagnosis must have 95% coverage for all patient demographics, not just on average. Standard conformal might give 99% coverage for the majority group and 88% for the minority group while averaging 95%.

When to Use It in Production

High-stakes classification with human-in-the-loop: model outputs prediction set, human reviews when set size > 1. Guarantees coverage without requiring Bayesian machinery. Distribution shift monitoring: track average prediction set size over time. As distribution shifts, set size grows (model becomes more uncertain). Size is a leading indicator of accuracy degradation. Applied Scientist interviews: 'How would you quantify prediction uncertainty in production?' Conformal prediction is the rigorous answer. Softmax confidence is the naive answer. ECE + temperature scaling is the intermediate answer.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →