Foundations & Architecture 10 min read

Encoder-Decoder Architecture: T5, BART, Cross-Attention, and When to Use It vs Decoder-Only

T5's text-to-text unification, BART's denoising pretraining, cross-attention mechanics, and the practical decision: encoder-decoder for seq2seq, decoder-only for open generation, encoder-only for embedding.

Encoder-Decoder: The Architecture for Sequence-to-Sequence Tasks

GPT-4, Claude, Llama — these are decoder-only models. BERT, RoBERTa — encoder-only. T5, BART, mBART — encoder-decoder. The distinction determines what a model is built for. Encoder-decoder architectures are purpose-built for tasks that require processing one sequence and generating another: translation, summarization, abstractive QA. Understanding when to use each saves you from architecture mismatch.

Architecture Anatomy

The encoder processes the entire input sequence bidirectionally, building contextual representations for each token. The decoder generates output tokens autoregressively — one at a time, left to right, attending to previously generated tokens (self-attention) AND to the full encoder output (cross-attention). It's this cross-attention that allows the decoder to 'read' the input while generating the output.

# Simplified encoder-decoder forward pass

# Encoder: full bidirectional attention over input
encoder_out = encoder(input_ids)  # [seq_len, d_model] — contextualized input reps

# Decoder: autoregressively generate output
for step in range(max_output_len):
    # Self-attention: causal (attend only to previous output tokens)
    # Cross-attention: attends to ALL encoder_out positions
    output_logits = decoder(output_so_far, encoder_out)
    next_token = sample(output_logits[-1])  # predict next token
    output_so_far.append(next_token)
    if next_token == EOS_TOKEN:
        break

T5: Everything Is Text-to-Text

T5 (Text-To-Text Transfer Transformer) unifies all NLP tasks under a single text-to-text format. Classification becomes 'classify: [sentence] → positive'. Summarization: 'summarize: [document] → [summary]'. Translation: 'translate English to French: [text] → [translation]'. This framing lets T5 be pretrained on a massive multi-task mixture using the same model and loss function for every task.

T5 uses relative position encodings (not absolute), which makes it more robust on varying input lengths. T5-base is 250M parameters; T5-large is 770M; T5-11B for highest quality. In practice, Flan-T5 (T5 fine-tuned on instruction data) significantly outperforms T5 on zero-shot and few-shot tasks and is the production default when you want encoder-decoder without full LLM scale.

BART: Denoising Autoencoder

BART pretrains by corrupting text with multiple noise functions — token masking, token deletion, sentence permutation, text infilling — then training the encoder-decoder to reconstruct the original. This makes BART particularly good at generative tasks where you're transforming corrupted or compressed text: summarization, dialogue, text style transfer.

BART's encoder is a full BERT-like bidirectional transformer. BART's decoder is a GPT-like autoregressive transformer. The cross-attention layer in the decoder attends to the final encoder representation. Fine-tuned on CNN/DailyMail, BART achieved state-of-art summarization ROUGE scores. mBART extends this to multilingual.

Cross-Attention: The Key Mechanism

In each decoder layer, the cross-attention sublayer computes queries from the current decoder state, but keys and values from the encoder output. This is what allows the decoder to 'focus on' different parts of the input at each generation step. For translation, cross-attention learns to align source and target tokens. For summarization, it learns to attend to salient passages.

# Cross-attention in the decoder
# Q comes from decoder hidden states
# K, V come from encoder output (fixed for all decoder steps)
Q = decoder_state @ W_q    # [dec_seq, d_model]
K = encoder_out  @ W_k    # [enc_seq, d_model]
V = encoder_out  @ W_v    # [enc_seq, d_model]

attention_weights = softmax(Q @ K.T / sqrt(d_k))  # [dec_seq, enc_seq]
output = attention_weights @ V                     # [dec_seq, d_model]

# attention_weights tells you: 'when generating this output token,
# which input tokens is the model focusing on?'

Encoder-Decoder vs Decoder-Only: The Practical Decision

The trend is toward decoder-only dominance. At large scales (>10B parameters), decoder-only models match or exceed encoder-decoder on most seq2seq benchmarks while being architecturally simpler. The encoder-decoder advantage shrinks as scale increases. For most new production systems, use a decoder-only LLM for generation tasks and encoder-only for embedding tasks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →