Foundations & Architecture 11 min read

Loss Functions Deep Dive: MSE, BCE, Focal Loss, Contrastive, and Triplet

Why the loss function is an architectural decision. MSE vs MAE for regression outliers, BCE vs MSE for classification, Focal Loss for class imbalance, contrastive loss for similarity learning, triplet loss for ranking. With the math that interviewers actually ask.

Loss Functions Are Assumptions About Your Data

A loss function is not an arbitrary choice. It encodes a specific assumption about the distribution of your errors and what kinds of mistakes you're willing to make. Choosing the wrong loss function doesn't just slow training — it trains a model that's wrong in a systematic way that your metrics might not catch. Applied Scientist interviews go deep here because loss function choice reveals whether you understand what your model is actually learning.

Mean Squared Error (MSE)

MSE = (1/n) Σ(y_i - ŷ_i)². Assumption: errors are normally distributed. Consequence: outliers have quadratic impact — a prediction error of 10 contributes 100x more to the loss than an error of 1. This means MSE trains models that are pulled toward outliers. Use MSE when outliers are real signal (a $10,000 transaction in fraud detection is important) and when errors should be penalized proportionally to their magnitude.

MAE = (1/n) Σ|y_i - ŷ_i|. Assumption: errors are Laplace-distributed. Consequence: all errors contribute linearly — an error of 10 contributes 10x an error of 1. Use MAE when outliers are noise (a mislabeled data point shouldn't dominate training). MAE is not differentiable at 0 (subgradient methods required). Huber loss combines MSE for small errors and MAE for large errors, controlled by a threshold δ.

Binary Cross-Entropy

BCE = -(y log(p) + (1-y) log(1-p)). This is the negative log likelihood under a Bernoulli distribution — you're assuming each label is independently drawn from a Bernoulli with parameter p. When y=1 and p→0, the loss → ∞: the model is severely penalized for being confidently wrong. When y=1 and p→1, loss → 0. The log prevents the model from simply outputting 0.5 to minimize risk — confident correct predictions are rewarded.

import torch
import torch.nn.functional as F

# Why BCE needs sigmoid (not softmax for binary)
logits = torch.tensor([2.0, -1.0, 0.5])  # raw model outputs
probs = torch.sigmoid(logits)  # [0.88, 0.27, 0.62]
labels = torch.tensor([1.0, 0.0, 1.0])

# Numerically stable version (BCEWithLogitsLoss)
loss = F.binary_cross_entropy_with_logits(logits, labels)
# Equivalent to: -mean(y*log(σ(x)) + (1-y)*log(1-σ(x)))

# For multiclass: softmax + cross entropy
logits_multi = torch.tensor([[2.0, 1.0, 0.1]])
labels_multi = torch.tensor([0])  # true class is 0
loss_multi = F.cross_entropy(logits_multi, labels_multi)
# cross_entropy = NLLLoss(LogSoftmax(logits))

Focal Loss: When Classes Are Imbalanced

In fraud detection, 99.9% of transactions are legitimate. Standard cross-entropy trains the model to always predict 'not fraud' — this achieves 99.9% accuracy while being useless. Focal Loss (Lin et al., 2017) = -(1-p_t)^γ × log(p_t). The factor (1-p_t)^γ downweights easy examples. When γ=0, this is standard cross-entropy. When γ=2, examples the model classifies with 90% confidence contribute 100x less loss than hard examples. The model focuses on hard examples — the ones near the decision boundary.

Contrastive and Triplet Loss

Used for metric learning — when you want embeddings to be close for similar items and far for dissimilar items. Contrastive loss: L = (1-y) × (1/2) × d² + y × (1/2) × max(0, margin - d)², where d is distance between embeddings and y=1 for dissimilar pairs. Triplet loss: L = max(0, d(anchor, positive) - d(anchor, negative) + margin). The margin ensures negatives are pushed beyond a minimum distance from positives. This is what trains SBERT and two-tower recommendation models.

The Applied Scientist Question Pattern

'Your regression model has high MSE but low MAE — what does this tell you?' → Outliers are pulling MSE up. The model's median prediction is good but there are some very wrong predictions on edge cases.
'You're training a CTR model on imbalanced data (1% CTR). What loss do you use?' → Focal loss, or reweight BCE (positive weight = negative count / positive count), or undersample negatives.
'Why does cross-entropy work better than MSE for classification?' → Softmax outputs near 0 have near-zero gradients under MSE (vanishing gradient). Cross-entropy gradients are proportional to prediction error at all values.
'What loss would you use for learning to rank?' → Pairwise: BPR or RankNet. Listwise: LambdaMART. Pointwise: can work but ignores rank structure.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →