Foundations & Architecture 10 min read

Optimizers Explained: Adam's Second Moment, AdamW Decoupled Weight Decay, and Learning Rate Warmup

Why SGD loses to Adam on most deep learning tasks. Adam's first moment (momentum) + second moment (adaptive learning rates), AdamW's correction that decouples weight decay from gradient update, and why LR warmup prevents early catastrophic updates.

What Gradient Descent Is Actually Doing

Gradient descent updates parameters by moving in the direction opposite to the gradient of the loss. θ ← θ - η∇L(θ). The learning rate η controls step size. Too large: overshoot the minimum, loss oscillates or diverges. Too small: training takes forever and can get stuck in flat regions. The entire history of optimization research is the search for better ways to adapt this step.

SGD with Momentum

Vanilla SGD has high variance — each mini-batch gradient is a noisy estimate. Momentum accumulates a running average of gradients: v_t = β × v_{t-1} + (1-β) × ∇L, θ ← θ - η × v_t. With β=0.9, each update is a weighted average of the last ~10 gradients. This smooths out noise and accelerates convergence in directions of consistent gradient. The physical intuition: a ball rolling downhill accumulates momentum, overshoots flat regions, and dampens oscillations in high-curvature dimensions.

Adam: Adaptive Moment Estimation

Adam maintains per-parameter learning rates by tracking both the first moment (mean gradient) and second moment (uncentered variance). m_t = β₁ m_{t-1} + (1-β₁) g_t (first moment). v_t = β₂ v_{t-1} + (1-β₂) g_t² (second moment). Bias-corrected: m̂_t = m_t/(1-β₁^t), v̂_t = v_t/(1-β₂^t). Update: θ ← θ - η × m̂_t / (√v̂_t + ε).

The key insight: parameters that receive large, frequent gradients (e.g., common word embeddings in NLP) get smaller effective learning rates. Parameters that receive small, rare gradients get larger effective learning rates. This adaptive behavior makes Adam work well out of the box across many architectures and data types — it's the reason it became the default.

# Adam from scratch (simplified)
def adam_update(param, grad, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
    m = beta1 * m + (1 - beta1) * grad          # first moment
    v = beta2 * v + (1 - beta2) * grad**2        # second moment
    m_hat = m / (1 - beta1**t)                   # bias correction
    v_hat = v / (1 - beta2**t)                   # bias correction
    param -= lr * m_hat / (torch.sqrt(v_hat) + eps)
    return param, m, v

# Default hyperparameters: lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8
# These work for most cases. When they don't:
# - lr too high: loss spikes or NaN. Try 3e-4.
# - beta2 too high: slow adaptation to changing loss landscape. Try 0.99.
# - eps too small: numerical instability on sparse gradients. Try 1e-6.

AdamW: The Modern Standard

Standard Adam implements L2 regularization incorrectly. L2 reg adds λθ to the gradient, but Adam's adaptive scaling means the effective regularization is different for each parameter. AdamW decouples weight decay from the gradient update: θ ← θ × (1 - η × λ) - η × adam_update. This is the standard for transformer training — GPT, BERT, LLaMA all use AdamW.

When SGD Beats Adam

Despite Adam's practical dominance, SGD with momentum + learning rate schedule often achieves better final generalization on image classification tasks (ResNet, VGG). Hypothesis: Adam's adaptive learning rates can find sharp minima that generalize poorly, while SGD's uniform scaling finds flatter minima that generalize better. In practice: for new architectures or tasks, start with Adam. If generalization is poor, try SGD + cosine LR schedule. For transformers: always AdamW.

Learning Rate Schedules

Warmup + cosine decay: start with a small LR, ramp up linearly for N steps, then decay following a cosine curve to near-zero. Standard for transformers. Warmup prevents early instability when gradients are large and model is far from optimum.
One-cycle policy: LR increases then decreases within a single training run. Often achieves good results in fewer epochs (super-convergence).
Reduce on plateau: monitor validation loss, reduce LR by factor (e.g. 0.5) when loss stops improving. Simple but effective for CNNs.
Constant LR: sometimes the right answer, especially for fine-tuning. Avoids the complexity of schedule tuning.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →