Foundations & Architecture 12 min read

Probabilistic Graphical Models: Bayesian Networks, MRFs, Latent Variable Models, and EM

Bayesian networks factorize the joint distribution along a DAG. D-separation and the collider bias (conditioning on a common effect creates spurious correlation). MRFs and CRFs for sequence labeling. Latent variable models (GMM, LDA, VAE) and when to use EM vs. variational inference.

Probabilistic Graphical Models: Bayesian Networks, MRFs, and Latent Variable Models

Probabilistic graphical models (PGMs) are the language of structured uncertainty. When your data has known conditional independence structure — a document topic influences its words, a user's intent influences their query — PGMs let you encode that structure into the model rather than forcing a neural net to learn it from scratch.

Bayesian Networks: Directed Graphical Models

A Bayesian network encodes the joint distribution P(X1, ..., Xn) as a product of conditional distributions, one per node. Each node is conditionally independent of its non-descendants given its parents.

# Joint distribution factorizes along the graph
# For a simple chain A → B → C:
# P(A, B, C) = P(A) × P(B|A) × P(C|B)

# D-separation: X and Y are conditionally independent given Z
# if Z d-separates X from Y in the graph
# Three patterns:
# Chain: X → Z → Y   |  X ⊥ Y | Z  (Z blocks the path)
# Fork:  X ← Z → Y   |  X ⊥ Y | Z  (Z blocks the path)
# Collider: X → Z ← Y | X and Y are independent, but X ⊥̸ Y | Z (conditioning OPENS path)

The collider pattern is counterintuitive: X and Y are marginally independent, but conditioning on their common effect Z creates dependence. Example: Disease and Injury are independent causes of Hospitalization. Knowing someone is hospitalized makes Disease and Injury correlated — if they're not injured, they're more likely diseased.

Markov Random Fields: Undirected Graphical Models

MRFs encode pairwise potentials between connected variables. No direction — good for modeling symmetric relationships like pixel neighborhoods in images or word co-occurrences in text. The joint distribution is a product of potential functions over cliques, normalized by the partition function Z (which is usually intractable).

Where MRFs appear in ML: CRFs (Conditional Random Fields) are discriminative MRFs used for sequence labeling (NER, POS tagging). The transition matrix in a CRF encodes how likely one label is to follow another.

Latent Variable Models

Latent variable models assume observed data X is generated from unobserved (latent) variables Z. Learning requires marginalizing over Z — summing or integrating over all possible values of the latent variables. This is usually intractable exactly.

Gaussian Mixture Model: Z is discrete (which cluster), X is Gaussian given Z. Solvable exactly via EM. LDA (Latent Dirichlet Allocation): Z is a topic assignment per word. Approximate inference via variational EM. VAE: Z is a continuous latent code. Inference via amortized variational inference (the encoder network). The ELBO objective = reconstruction loss + KL(posterior || prior). EM algorithm: for models where the E-step (compute expected sufficient statistics) is tractable. Not applicable to VAEs — hence amortized VI.

The EM Algorithm

For latent variable models where the posterior P(Z|X) is tractable, EM finds a local maximum of the marginal likelihood P(X). E-step: compute Q(Z) = P(Z|X, θ_old). M-step: update θ to maximize E_Q[log P(X,Z|θ)]. Repeat.

# EM for Gaussian Mixture Model (sketch)
# E-step: compute responsibilities r_nk = P(z_k | x_n, params)
# r_nk = π_k * N(x_n | μ_k, Σ_k) / Σ_j π_j * N(x_n | μ_j, Σ_j)

# M-step: update parameters using weighted MLE
# N_k = Σ_n r_nk  (effective number of points in cluster k)
# π_k = N_k / N
# μ_k = Σ_n r_nk * x_n / N_k
# Σ_k = Σ_n r_nk * (x_n - μ_k)(x_n - μ_k)^T / N_k

When Applied Scientist Interviews Probe This

'Explain LDA and why it requires approximate inference.' → Dirichlet-Categorical model, posterior over topic assignments is intractable due to coupling across words, requires variational EM or Gibbs sampling. 'Why does the VAE use the reparameterization trick?' → To backpropagate through the sampling operation z ~ q(z|x). Sample z = μ + σ·ε, ε ~ N(0,1). The randomness is moved to ε which doesn't depend on parameters. 'When would you use a PGM over a neural network?' → When you have known conditional independence structure (encode it), when you need interpretable latent variables, when training data is scarce and you have strong prior knowledge about structure.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →