Foundations & Architecture 12 min read

Bayesian Reasoning for ML: MAP vs MLE, Priors as Regularization, Uncertainty Quantification

L1 and L2 regularization are MAP estimation under Laplace and Gaussian priors. Bayesian A/B testing gives P(variant > control) directly. Gaussian Processes quantify uncertainty between training points. The math Applied Scientist interviews actually probe.

The Prior Is Not Cheating

Bayesian reasoning gets dismissed as 'just put a prior on it.' That framing misses the point. The prior is your model of the world before seeing data. Ignoring it doesn't make it go away — it just means you're secretly using a flat prior, which is its own strong assumption.

MLE vs MAP: What's Actually Different

Maximum Likelihood Estimation (MLE): find the parameters θ that maximize P(data | θ). Maximum A Posteriori (MAP): find θ that maximizes P(θ | data) = P(data | θ) × P(θ) / P(data). MAP is MLE with a prior term. That's it.

# MLE: maximize log P(data | θ)
# Equivalent to minimizing negative log-likelihood

# MAP: maximize log P(θ | data)
#    = log P(data | θ) + log P(θ) + const
#    = MLE objective + log prior

# With Gaussian prior P(θ) ~ N(0, σ²):
# log P(θ) = -θ²/(2σ²) + const
# → MAP = MLE - λ||θ||²  (L2 regularization)

# With Laplace prior P(θ) ~ Laplace(0, b):
# log P(θ) = -|θ|/b + const
# → MAP = MLE - λ||θ||₁  (L1 regularization)

L1 and L2 regularization are not tricks. They are MAP estimation under Laplace and Gaussian priors respectively. When you tune a regularization strength λ, you are choosing the strength of your prior.

Bayes' Theorem: The Mechanics

P(θ | data) = P(data | θ) × P(θ) / P(data). Posterior = Likelihood × Prior / Evidence. The evidence P(data) is just a normalizing constant — it doesn't depend on θ, so for optimization you can ignore it. You're maximizing Likelihood × Prior.

Why Uncertainty Quantification Matters

MLE gives you a point estimate — one set of parameters. Bayesian inference gives you a posterior distribution over parameters. The difference matters when you're making decisions under uncertainty.

Point estimate: model says 87% probability. What's the confidence interval on that 87%? Unknown. Posterior: model says 87% ± 12%. You know you should hedge. Applied Scientist interviews: they'll ask 'how confident are you in this model's output?' MLE can't answer that rigorously. Calibration (ECE) is a frequentist approximation of this — checking if predicted probabilities match observed frequencies.

Conjugate Priors: The Closed-Form Cases

When prior and likelihood are from the same exponential family, the posterior is also in that family — called a conjugate pair. The posterior has a closed-form solution, no MCMC needed.

Beta-Binomial: The AB Test Example

You're running an A/B test. Control: 40 conversions out of 200. Variant: 48 out of 200. Is the variant better?

from scipy import stats
import numpy as np

# Prior: Beta(1, 1) = uniform (no prior belief)
# Update with data: Beta(1+successes, 1+failures)
control_posterior = stats.beta(1 + 40, 1 + 160)
variant_posterior = stats.beta(1 + 48, 1 + 152)

# Probability variant > control via Monte Carlo
samples = 100_000
control_samples = control_posterior.rvs(samples)
variant_samples = variant_posterior.rvs(samples)
print(f'P(variant > control) = {(variant_samples > control_samples).mean():.3f}')
# → 0.847: 84.7% probability variant is better

# Credible interval (not confidence interval)
print(f'Variant 95% credible interval: {variant_posterior.ppf([0.025, 0.975])}')

Bayesian A/B testing gives you P(variant > control) directly — what you actually want to know. Frequentist p-values give you P(data | null hypothesis is true) — which is not what you want to know and requires careful interpretation.

Gaussian Processes: Bayesian Non-Parametric Regression

A Gaussian Process defines a prior over functions. Instead of 'I think the true function has these parameters,' you say 'I think the true function is smooth in this way.' After seeing data, the posterior GP is the distribution over functions consistent with what you've observed — including uncertainty that grows in regions with no data.

GPs are the rigorous answer to 'how confident should my model be between training points?' The kernel function encodes your prior about function smoothness. RBF kernel = nearby points should have similar outputs. Periodic kernel = function repeats.

When Interviewers Test This

Applied Scientist roles at Cohere, Amazon Science, Google Research, and AI-native startups probe Bayesian reasoning in three ways: (1) 'Why does L2 regularization help?' — they want the MAP derivation, not 'it prevents overfitting.' (2) 'How do you quantify uncertainty in your model's predictions?' — they want posterior distributions, calibration, or conformal prediction, not just softmax scores. (3) 'How would you design this A/B test?' — they want prior specification, sample size from power analysis, and Bayesian stopping rules.

Common trap: saying 'I'd use 95% confidence interval' when a 95% credible interval is what you actually mean. These are different things. Frequentist CI: if we repeated the experiment many times, 95% of constructed intervals would contain the true parameter. Bayesian CI: given this data, there's a 95% probability the parameter is in this range. The Bayesian answer is what people intuitively want. Another trap: treating softmax probabilities as calibrated uncertainty. They're not. A softmax score of 0.95 does not mean 95% probability of being correct — especially on out-of-distribution inputs.

Variational Inference: When the Posterior Is Intractable

For complex models (deep networks, large latent variable models), the true posterior P(θ | data) is intractable — the integral in the denominator doesn't have a closed form. Variational Inference (VI) approximates it: find the distribution Q(θ) from a tractable family (e.g., Gaussian) that is closest to the true posterior, measured by KL divergence. This turns Bayesian inference into an optimization problem.

VAEs use VI. The encoder outputs μ and σ — parameters of the approximate posterior Q(z | x). The ELBO (Evidence Lower BOund) objective is what you're optimizing: ELBO = E[log P(x | z)] - KL(Q(z | x) || P(z)). The first term is reconstruction quality. The second term is the KL penalty that keeps the posterior close to the prior.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →