Foundations & Architecture 12 min read

The Training Loop From Scratch: Forward, Backward, Gradient Descent in NumPy

Forward pass, loss, backpropagation by hand, weight update. No PyTorch. No autograd. The chain rule applied layer by layer — the same computation PyTorch automates. Build it once and every framework makes sense.

Every neural network trains through the same loop: forward pass, compute loss, backward pass, update weights. PyTorch automates the backward pass. But if you do not understand what happens before the automation, you cannot debug training failures, interpret learning curves, or make informed decisions about optimisers. This post implements the full loop in NumPy: no autograd, no magic.

The forward pass

For a two-layer MLP: Z1 = X·W1 + b1, A1 = ReLU(Z1), output = A1·W2 + b2. Mean squared error loss: L = mean((output - y)^2). Everything before this is computation. The gradient tells you which direction decreases this loss.

import numpy as np

np.random.seed(0)
X = np.linspace(-np.pi, np.pi, 200).reshape(-1, 1)
y = np.sin(X) + 0.1 * np.random.randn(*X.shape)

d_in, d_h, d_out = 1, 64, 1
W1 = np.random.randn(d_in, d_h) * np.sqrt(2/d_in)
b1 = np.zeros((1, d_h))
W2 = np.random.randn(d_h, d_out) * np.sqrt(2/d_h)
b2 = np.zeros((1, d_out))
lr = 0.01
losses = []

for step in range(2001):
    # Forward
    Z1  = X @ W1 + b1
    A1  = np.maximum(0, Z1)          # ReLU
    out = A1 @ W2 + b2
    loss = np.mean((out - y) ** 2)
    losses.append(loss)

    # Backward — chain rule by hand
    N = X.shape[0]
    d_out_layer = 2 * (out - y) / N
    dW2 = A1.T @ d_out_layer
    db2 = d_out_layer.sum(axis=0, keepdims=True)
    d_A1 = d_out_layer @ W2.T
    d_Z1 = d_A1 * (Z1 > 0)          # ReLU gradient
    dW1 = X.T @ d_Z1
    db1 = d_Z1.sum(axis=0, keepdims=True)

    # Gradient descent update
    W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2

    if step % 500 == 0:
        print(f"Step {step:5d}  loss={loss:.6f}")

What each gradient means

dL/dW2 tells you: if you increase W2[i,j] by epsilon, the loss changes by dW2[i,j] * epsilon. The ReLU backward is d_Z1 = d_A1 * (Z1 > 0) — zero wherever Z1 was negative (dead neuron), gradient passes through wherever Z1 was positive. This is why initialisation matters: bad random initialisation → many dead neurons from step 1 that never recover.

Why PyTorch's autograd does the same thing

torch.tensor(..., requires_grad=True) builds a computation graph during the forward pass. Every operation records how to compute its gradient. loss.backward() traverses this graph in reverse, applying the chain rule above. param.grad holds the result. optimizer.step() runs the weight update. You called the same NumPy code, but derivative calculations were automated.

Adam vs SGD: a better update rule

SGD: W -= lr * dW. Adam maintains a running average of gradients (m, first moment) and squared gradients (v, second moment). Update: W -= lr * m / (sqrt(v) + eps). Adam automatically gives smaller updates to parameters with consistently large gradients and larger updates to parameters with small gradients. Same chain-rule derivatives as above — just a smarter step size per parameter.

Extend this to cross-entropy classification. The combined gradient for cross-entropy+softmax is d_logits = softmax(out) - one_hot(y). This elegant result is one of the cleanest in backpropagation — implement it and trace exactly where each term comes from.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →