Foundations & Architecture 13 min read

RNNs, LSTMs, and the Vanishing Gradient: What Transformers Replaced

Recurrence, backpropagation through time, and why gradients collapse over long sequences. LSTM gates as learned gradient highways — and why they still could not fully solve what transformers solved trivially through parallelism.

For six years — from roughly 2012 to 2018 — LSTMs were the engine of nearly every production NLP system. Machine translation at Google, speech recognition at Apple, autocomplete at Gmail. Then transformers appeared and in eighteen months every LSTM was replaced. Understanding what LSTMs solved, and why they still could not solve it completely, is the clearest explanation of what makes transformers different.

The recurrent architecture

An RNN reads a sequence one token at a time. At each step t, it combines the current input xt with a hidden state ht-1 carried forward from the previous step: ht = tanh(Wx·xt + Wh·ht-1 + b). The hidden state is a fixed-dimensional vector — say, 256 numbers — that must compress all information from the past. After reading the whole sequence, ht encodes, in theory, the entire history. In practice, it encodes mostly the recent past.

The vanishing gradient problem

Backpropagation through time (BPTT) unrolls the RNN into a very deep network — one layer per timestep — and computes gradients by the chain rule. The gradient at step t depends on the gradient at step t+1 multiplied by the Jacobian of the hidden state transition. That Jacobian involves the same weight matrix Wh applied again and again. If the eigenvalues of Wh are less than 1, gradients shrink exponentially toward zero as you go backward through time. If they are greater than 1, gradients explode. Either way, the error signal from token 50 barely reaches token 1.

import numpy as np

def rnn_forward(xs, Wx, Wh, bh, seq_len=50):
    """Forward pass collecting hidden states and pre-activations."""
    d_h = Wh.shape[0]
    h = np.zeros(d_h)
    hs, pre_acts = [], []
    for x in xs[:seq_len]:
        z = Wx @ x + Wh @ h + bh
        pre_acts.append(z)
        h = np.tanh(z)
        hs.append(h.copy())
    return hs, pre_acts

np.random.seed(42)
d_in, d_h = 10, 32
Wx = np.random.randn(d_h, d_in) * 0.1
Wh = np.random.randn(d_h, d_h) * 0.1   # small weights → vanishing
bh = np.zeros(d_h)
xs = [np.random.randn(d_in) for _ in range(50)]

hs, pre_acts = rnn_forward(xs, Wx, Wh, bh)

# Compute gradient magnitude at each step (simplified BPTT)
# dL/dh50 = ones (upstream gradient)
grad = np.ones(d_h)
grad_norms = []
for t in reversed(range(len(hs))):
    tanh_grad = 1 - hs[t]**2          # derivative of tanh
    grad = Wh.T @ (tanh_grad * grad)  # chain rule through Wh
    grad_norms.append(np.linalg.norm(grad))

print("Gradient norm at each step (most-recent to oldest):")
for i, g in enumerate(grad_norms):
    bar = "█" * min(40, int(g * 200))
    print(f"  step {len(grad_norms)-i:3d}: {g:.6f} {bar}")

Run this and you will see the gradient norm drop from ~1.0 at the last step to ~0.0001 by step 30. The error signal from the final prediction barely changes weights that processed the first tokens. This is why vanilla RNNs cannot remember that the subject of a 50-word sentence was singular when trying to predict verb agreement at the end.

LSTM: learned gating as gradient highways

The Long Short-Term Memory network (Hochreiter & Schmidhuber, 1997) adds a cell state ct — a second, more slowly changing memory that flows forward with minimal transformation. Three gates — input, forget, output — are sigmoid functions (output 0 to 1) that learn when to write, erase, and read from the cell state. The key equation: ct = ft ⊙ ct-1 + it ⊙ g̃t. The forget gate ft controls how much of the old cell state to keep. The input gate it controls how much of the new candidate g̃t to add. The output gate ot controls how much of the cell state to expose as the hidden state.

import numpy as np

class LSTMCell:
    def __init__(self, d_in, d_h):
        # Concatenate all four gate weight matrices for efficiency
        self.W = np.random.randn(4 * d_h, d_in + d_h) * 0.1
        self.b = np.zeros(4 * d_h)
        self.d_h = d_h

    def forward(self, x, h_prev, c_prev):
        # Concatenate input and previous hidden state
        xh = np.concatenate([x, h_prev])   # (d_in + d_h,)
        gates = self.W @ xh + self.b        # (4*d_h,)

        # Split into four gates
        d = self.d_h
        i  = self._sigmoid(gates[0*d:1*d])   # input gate
        f  = self._sigmoid(gates[1*d:2*d])   # forget gate
        o  = self._sigmoid(gates[2*d:3*d])   # output gate
        g  = np.tanh(gates[3*d:4*d])         # cell gate (candidate)

        # Cell state update: forget old, write new
        c = f * c_prev + i * g               # ← gradient highway
        # Hidden state: read from cell state through output gate
        h = o * np.tanh(c)

        return h, c, {"i": i, "f": f, "o": o, "g": g}

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -20, 20)))

# Run a sequence through the LSTM cell
d_in, d_h, seq_len = 10, 32, 20
cell = LSTMCell(d_in, d_h)
h, c = np.zeros(d_h), np.zeros(d_h)
for t in range(seq_len):
    x = np.random.randn(d_in)
    h, c, gates = cell.forward(x, h, c)
    if t % 5 == 0:
        print(f"Step {t:2d} | forget gate mean: {gates['f'].mean():.3f} | "
              f"cell norm: {np.linalg.norm(c):.3f}")

The gradient highway is the cell state update: ct = ft ⊙ ct-1 + it ⊙ g̃t. Because this is addition, not multiplication through a squashing function, gradients can flow backward through the cell state path without shrinking. The forget gate is the key — when ft is close to 1, the cell state is preserved almost perfectly, and gradients flow back unchanged. The LSTM does not eliminate the vanishing gradient problem; it gives the network a learnable mechanism to route gradients around the problematic multiplicative path.

Why transformers replaced them

Two reasons, both fundamental. First, LSTMs are sequential by construction — you cannot compute ht until you have ht-1. This means you cannot parallelise across time steps during training. Transformers use self-attention, which computes all positions simultaneously. On modern GPUs with thousands of cores, this difference in parallelism translates directly into training speed and scale.

Second, the LSTM hidden state still has a fixed capacity bottleneck. Information from 100 steps ago must be compressed into the same 256-dimensional vector as information from 1 step ago. The forget gate helps, but it is a learned compression. Attention has no such bottleneck: every position can attend directly to every other position with full fidelity. Long-range dependencies that LSTMs handle imperfectly, transformers handle natively.

LSTMs are not obsolete — they run efficiently on embedded hardware, train fast on short sequences, and still appear in production systems where GPU scale is not available. But for any task where long context, parallelism, or scale matters, transformers are the right architecture. Knowing exactly why is what makes you useful in a system design interview.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →