Foundations & Architecture 11 min read

Seq2Seq and Bahdanau Attention: The Attention Mechanism Before Transformers

The encoder-decoder bottleneck problem, Bahdanau's fix (query all encoder states at every decoder step), and the direct line from additive attention to transformer dot-product attention. NumPy implementation of the alignment mechanism.

In 2014, two papers were published that changed how machines translate language. The first — Sutskever, Vinyals, Le — showed that an encoder LSTM could compress a sentence into a vector, and a decoder LSTM could expand that vector into a translation. The second — Bahdanau, Cho, Bengio — added a mechanism for the decoder to look back at all encoder states instead of a single compressed vector. That mechanism was called attention. It is the direct ancestor of the self-attention in transformers.

The bottleneck problem in encoder-decoder

The encoder reads the source sentence token by token and produces a final hidden state — a fixed-size vector that must encode everything the decoder will need to produce the translation. For short sentences this works. For sentences of 30+ words, the encoder must compress an arbitrary amount of information into a 512-dimensional vector. Quality degrades sharply with sentence length. This was the bottleneck problem.

Bahdanau attention: query the encoder at every decoder step

Bahdanau's insight: instead of compressing the source into a single vector, keep all encoder hidden states h1, h2, ..., hT and let the decoder query them at every step. At each decoder step, compute a score between the current decoder state st and each encoder state hi. Normalise the scores with softmax to get attention weights αi. Compute a context vector c = Σ αi·hi. Concatenate c with the decoder state and use this to generate the next output token.

import numpy as np

def bahdanau_attention(encoder_states, decoder_state, W_a, U_a, v_a):
    """
    encoder_states: (T, d_h) — all encoder hidden states
    decoder_state:  (d_h,)   — current decoder hidden state
    W_a, U_a:       (d_a, d_h) — alignment model weights
    v_a:            (d_a,)    — alignment model vector
    """
    T, d_h = encoder_states.shape
    d_a = v_a.shape[0]

    # Energy function: e_i = v_a · tanh(W_a·s + U_a·h_i)
    # Broadcast decoder projection across all encoder states
    decoder_proj  = (W_a @ decoder_state)              # (d_a,)
    encoder_proj  = (U_a @ encoder_states.T).T        # (T, d_a)
    combined      = np.tanh(encoder_proj + decoder_proj)  # (T, d_a)
    energies      = combined @ v_a                     # (T,)

    # Softmax to get attention weights
    e_max    = energies.max()
    exp_e    = np.exp(energies - e_max)
    weights  = exp_e / exp_e.sum()                     # (T,)

    # Context vector: weighted sum of encoder states
    context  = (weights[:, None] * encoder_states).sum(axis=0)  # (d_h,)

    return weights, context


# ── Demo ─────────────────────────────────────────────────────────────────────
np.random.seed(42)
T_src, d_h, d_a = 8, 64, 32

# Simulate encoder reading "The quick brown fox jumped over the fence"
encoder_states = np.random.randn(T_src, d_h) * 0.5
decoder_state  = np.random.randn(d_h) * 0.5

# Alignment model parameters
W_a = np.random.randn(d_a, d_h) * 0.1
U_a = np.random.randn(d_a, d_h) * 0.1
v_a = np.random.randn(d_a) * 0.1

weights, context = bahdanau_attention(encoder_states, decoder_state, W_a, U_a, v_a)

src_tokens = ["The", "quick", "brown", "fox", "jumped", "over", "the", "fence"]
print("Attention weights at this decoder step:")
for token, weight in zip(src_tokens, weights):
    bar = "█" * int(weight * 80)
    print(f"  {token:8s}: {weight:.3f} {bar}")
print(f"\nContext vector shape: {context.shape}, norm: {np.linalg.norm(context):.3f}")

The connection to transformer self-attention

Bahdanau attention computes: energy(st, hi) = v·tanh(W·st + U·hi). The transformer attention computes: score(Q, K) = Q·Kᵀ / sqrt(dk). Different parameterisation, same conceptual operation: compare the current query (decoder state / Q vector) against all available keys (encoder states / K vectors), produce a weighted sum of values (encoder states / V vectors). The critical difference is that transformer attention is multiplicative (dot product) rather than additive (MLP), which is faster and easier to parallelise. The scaling by sqrt(dk) prevents the dot products from growing too large with high-dimensional vectors, which would cause the softmax to saturate.

In transformers, every token can attend to every other token simultaneously. In Bahdanau attention, only the decoder queries the encoder, and it does so sequentially, one decoder step at a time. The 'self' in self-attention means the same sequence queries itself — every token queries every other token in the same sequence. This is the generalisation that made transformers so powerful: you can model relationships within a sequence without any recurrence at all.

What this means for encoder-decoder architectures today

T5, BART, and mT5 are still encoder-decoder models. The encoder reads the full source and produces bidirectional representations. The decoder attends to encoder representations via cross-attention (which is Bahdanau attention, upgraded to transformer form) and generates autoregressively. Tasks that are natural fits: translation, summarisation, question answering where the source and target are different sequences.

GPT is decoder-only: no separate encoder, just a single sequence processed with causal masking. BERT is encoder-only: bidirectional processing, no generation. The architectural choice is not arbitrary — it determines what the model can do. Encoder-only models are efficient for classification and retrieval (embed a full sentence bidirectionally). Decoder-only models are efficient for generation (each token conditions on all previous tokens). Encoder-decoder models are natural for sequence-to-sequence tasks. Knowing where Bahdanau attention sits in this history tells you why these three families exist.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →