The Training Loop From Scratch: Forward, Backward, Gradient Descent in NumPy
Forward pass, loss, backpropagation by hand, weight update. No PyTorch. No autograd. The chain rule applied layer by layer — the same computation PyTorch automates. Build it once and every framework makes sense.
Every neural network trains through the same loop: forward pass, compute loss, backward pass, update weights. PyTorch automates the backward pass. But if you do not understand what happens before the automation, you cannot debug training failures, interpret learning curves, or make informed decisions about optimisers. This post implements the full loop in NumPy: no autograd, no magic.
The forward pass
For a two-layer MLP: Z1 = X·W1 + b1, A1 = ReLU(Z1), output = A1·W2 + b2. Mean squared error loss: L = mean((output - y)^2). Everything before this is computation. The gradient tells you which direction decreases this loss.
import numpy as np
np.random.seed(0)
X = np.linspace(-np.pi, np.pi, 200).reshape(-1, 1)
y = np.sin(X) + 0.1 * np.random.randn(*X.shape)
d_in, d_h, d_out = 1, 64, 1
W1 = np.random.randn(d_in, d_h) * np.sqrt(2/d_in)
b1 = np.zeros((1, d_h))
W2 = np.random.randn(d_h, d_out) * np.sqrt(2/d_h)
b2 = np.zeros((1, d_out))
lr = 0.01
losses = []
for step in range(2001):
# Forward
Z1 = X @ W1 + b1
A1 = np.maximum(0, Z1) # ReLU
out = A1 @ W2 + b2
loss = np.mean((out - y) ** 2)
losses.append(loss)
# Backward — chain rule by hand
N = X.shape[0]
d_out_layer = 2 * (out - y) / N
dW2 = A1.T @ d_out_layer
db2 = d_out_layer.sum(axis=0, keepdims=True)
d_A1 = d_out_layer @ W2.T
d_Z1 = d_A1 * (Z1 > 0) # ReLU gradient
dW1 = X.T @ d_Z1
db1 = d_Z1.sum(axis=0, keepdims=True)
# Gradient descent update
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
if step % 500 == 0:
print(f"Step {step:5d} loss={loss:.6f}")
What each gradient means
dL/dW2 tells you: if you increase W2[i,j] by epsilon, the loss changes by dW2[i,j] * epsilon. The ReLU backward is d_Z1 = d_A1 * (Z1 > 0) — zero wherever Z1 was negative (dead neuron), gradient passes through wherever Z1 was positive. This is why initialisation matters: bad random initialisation → many dead neurons from step 1 that never recover.
Why PyTorch's autograd does the same thing
torch.tensor(..., requires_grad=True) builds a computation graph during the forward pass. Every operation records how to compute its gradient. loss.backward() traverses this graph in reverse, applying the chain rule above. param.grad holds the result. optimizer.step() runs the weight update. You called the same NumPy code, but derivative calculations were automated.
Adam vs SGD: a better update rule
SGD: W -= lr * dW. Adam maintains a running average of gradients (m, first moment) and squared gradients (v, second moment). Update: W -= lr * m / (sqrt(v) + eps). Adam automatically gives smaller updates to parameters with consistently large gradients and larger updates to parameters with small gradients. Same chain-rule derivatives as above — just a smarter step size per parameter.
Extend this to cross-entropy classification. The combined gradient for cross-entropy+softmax is d_logits = softmax(out) - one_hot(y). This elegant result is one of the cleanest in backpropagation — implement it and trace exactly where each term comes from.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →