BERT vs GPT: Why the Architecture Determines What You Can Build
One architectural choice — causal masking — separates encoder from decoder models. MLM vs CLM training objectives, what each learns, which tasks each family wins, and why decoder-only models won at scale. The framework that makes every model deep-dive legible.
BERT and GPT are both transformer models. They share the same core attention mechanism, the same positional encodings, the same feedforward sublayers. They differ in one architectural choice — masking — that determines everything else: what they can learn, what tasks they are good at, and how you use them in production.
The core architectural difference
BERT is an encoder: it reads the full sequence bidirectionally. Every token can attend to every other token. If you feed BERT 'The cat sat on the mat', the representation of 'cat' is computed using all six other tokens simultaneously — it can look left and right. This produces rich contextual representations but cannot generate text, because generation requires attending only to the past.
GPT is a decoder: it reads the sequence with causal masking. Each token can only attend to previous tokens. The representation of 'cat' in position 2 can see 'The' and 'cat' but not 'sat on the mat'. This means GPT can be used autoregressively: generate token 1, feed it back in, generate token 2, and so on. The masking is the generation mechanism.
import numpy as np
def show_attention_mask(tokens, causal=False):
n = len(tokens)
if causal:
# Lower triangular: token i can see tokens 0..i
mask = np.tril(np.ones((n, n), dtype=int))
else:
# Full: every token sees every other token
mask = np.ones((n, n), dtype=int)
print(f"\n{'Causal (GPT-style)' if causal else 'Bidirectional (BERT-style)'} attention mask:")
header = " " + " ".join(f"{t[:5]:5s}" for t in tokens)
print(header)
for i, row_token in enumerate(tokens):
row = " ".join(" ✓ " if mask[i, j] else " ✗ " for j in range(n))
print(f"{row_token[:8]:8s} {row}")
tokens = ["[CLS]", "The", "cat", "sat", "on", "[SEP]"]
show_attention_mask(tokens, causal=False) # BERT
show_attention_mask(tokens, causal=True) # GPT
Training objectives and what they learn
BERT is trained with Masked Language Modeling (MLM): randomly mask 15% of tokens, train the model to predict the masked tokens using both left and right context. Because the model sees the full context, it learns deep bidirectional representations. The task forces it to understand each word in relation to everything around it. BERT also used Next Sentence Prediction (NSP) — later shown to be mostly useless — but MLM is what drives the representation quality.
GPT is trained with Causal Language Modeling (CLM): predict the next token given all previous tokens. This is just predicting the next word, repeated billions of times. The model learns everything that is predictable from context: grammar, facts, reasoning patterns, style. Scale this to billions of parameters and hundreds of billions of tokens, and you get a general-purpose reasoner. The generation objective is the reason GPT-style models dominate production: the same training that teaches prediction also teaches everything else.
T5 uses a span corruption objective: randomly mask spans of text (not individual tokens), replace each span with a single sentinel token, and train the encoder to produce the original spans. This encoder-decoder training makes T5 excellent at tasks framed as text-to-text: translate X to Y, summarise X, answer question X using context Y.
What each architecture is good at
Encoder-only (BERT, RoBERTa, DeBERTa): classification tasks (sentiment, intent detection, NLI), named entity recognition (sequence labelling), retrieval/embeddings (the bidirectional representation is richer for capturing full sentence meaning), extractive question answering (highlight the answer span in a document).
Decoder-only (GPT-2, GPT-3/4, LLaMA, Mistral, Claude, Gemini): generation tasks (summarisation, translation, code completion, instruction following, chat), few-shot and zero-shot tasks (the next-token objective learns to follow patterns), system design and reasoning.
Encoder-decoder (T5, BART, mT5): translation, abstractive summarisation, question answering with context, text editing, any task where the input and output sequences are different in structure. The encoder produces a rich bidirectional representation; the decoder generates autoregressively from it.
Why decoder-only models won at scale
Three reasons. First, the CLM objective is simpler and scales better: every token in every training sequence is a prediction target, so there is no wasted computation. MLM masks only 15% of tokens — 85% of the sequence contributes nothing to the loss. Second, decoder-only architectures are trivially generalisable to in-context learning and instruction following. A decoder can read instructions, examples, and a query all as a flat sequence and produce an answer. Third, the Chinchilla scaling laws apply cleanly: more compute → lower CLM loss → better generation across every task. BERT-style models plateau earlier because bidirectional pre-training is harder to scale without architectural changes.
The result: GPT-3 showed that a large enough decoder-only model, with zero task-specific fine-tuning, could outperform fine-tuned BERT on many benchmarks. The architecture convergence in modern AI — Claude, GPT-4, Gemini, Mistral, LLaMA — is all decoder-only. Encoder models still win for pure retrieval (embedding a sentence bidirectionally is better than embedding it causally), but for every generation task, the decoder won.
Implement both: run BERT-base and GPT-2-small on the same classification task using HuggingFace. Fine-tune BERT for 3 epochs (it converges fast, needs very little data). Use GPT-2 zero-shot with a carefully formatted prompt. Compare on 100 held-out examples. You will find BERT wins with <1k labelled examples; GPT-2 wins with 0 labelled examples if the prompt is well-designed. This is the fundamental trade-off between supervised fine-tuning and in-context learning.
- BERT: Pre-training of Deep Bidirectional Transformers — Devlin et al. (2019)
- Language Models are Few-Shot Learners (GPT-3) — Brown et al. (2020)
- Exploring the Limits of Transfer Learning with T5 — Raffel et al. (2020)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →