Foundations & Architecture 10 min read

BERT Internals: MLM, WordPiece, [CLS] Token, and Why It Fails for Semantic Similarity

BERT's masked language modeling, NSP pretraining, WordPiece tokenization, and the [CLS] pooling trap. Why vanilla BERT [CLS] produces near-random sentence similarity scores, and what SBERT does differently.

**Prerequisite: Steps 2–3 (Attention + MHA).** After this post you'll understand what makes BERT different from GPT, how masked language modelling pretrains an encoder, and why encoders are the right architecture for retrieval, classification, and embeddings.

What BERT Actually Does (And Why It Broke Everything)

BERT — Bidirectional Encoder Representations from Transformers — is the model that ended the era of task-specific architectures. Before BERT, an NLP engineer built a different model for NER, a different one for classification, a different one for QA. BERT showed you could pretrain one deep transformer encoder on massive unlabeled text, then fine-tune it on any task with a small labeled dataset and beat specialized architectures. That's the core insight. Everything else is implementation.

The Architecture: Encoder-Only

BERT is a stack of transformer encoder blocks. The 'encoder-only' distinction matters: unlike GPT which only attends to previous tokens (causal/autoregressive), BERT attends to all tokens in both directions simultaneously. This bidirectional attention is what makes BERT powerful for understanding tasks — a token's representation is conditioned on everything around it, not just what came before.

BERT-base: 12 transformer layers, 12 attention heads, 768 hidden dimension, 110M parameters. BERT-large: 24 layers, 16 heads, 1024 dim, 340M parameters. Most production fine-tuning uses BERT-base or a distilled variant — BERT-large requires significantly more memory and inference time.

Pretraining Objective 1: Masked Language Modeling

During pretraining, 15% of input tokens are randomly selected. Of those: 80% are replaced with a [MASK] token, 10% are replaced with a random token, 10% are left unchanged. The model must predict the original token at each masked position. This forces deep bidirectional representations — to predict a masked word, you have to understand the full context around it.

The 10% random + 10% unchanged trick exists because at fine-tuning time, [MASK] tokens never appear. Without this, the model would learn to rely entirely on [MASK] as a signal, degrading transfer performance. The noise makes representations more robust.

Pretraining Objective 2: Next Sentence Prediction

BERT is also trained to predict whether two sentences are consecutive (50% of the time) or randomly sampled from the corpus (50%). The input format is [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's representation is used for the binary classification.

NSP is largely considered a weak objective. RoBERTa (a BERT variant from Facebook) dropped NSP entirely and achieved better performance with just MLM and more data. The key insight was that NSP was too easy — the model could exploit topic differences between random sentence pairs rather than learning true discourse relationships.

The [CLS] Token — And Why It Fails for Semantic Similarity

The [CLS] token was designed as an aggregate sequence representation for classification tasks. After fine-tuning on a classification task, the [CLS] vector is fed to a linear layer for the prediction. This works well for fine-tuned BERT. It does NOT work for sentence similarity out of the box.

The common mistake: take any two sentences, run each through BERT, extract the [CLS] vectors, compute cosine similarity. This produces near-random similarity scores. Reimers and Gurevych (2019) showed that vanilla BERT [CLS] representations cluster by sentence length and surface features, not semantic content. Two semantically identical sentences can have low cosine similarity; two semantically different sentences can have high similarity.

The [CLS] token is not a universal sentence embedding. It is a task-specific representation that emerges after fine-tuning. For semantic similarity, you need Sentence-BERT (SBERT) — a model trained with a siamese network and contrastive objectives specifically to produce good sentence embeddings.

WordPiece Tokenization

BERT uses WordPiece tokenization — a subword algorithm that decomposes rare and OOV words into known subword pieces. 'unbelievable' might become ['un', '##believe', '##able']. The '##' prefix marks continuation pieces. This handles morphologically rich languages, rare words, and technical vocabulary without a massive vocabulary.

BERT's vocabulary is 30,000 tokens. Every input must start with [CLS] and end with [SEP]. Maximum input length is 512 tokens — a hard architectural limit set by the positional embeddings. Processing longer documents requires chunking, sliding windows, or switching to a long-context model.

Fine-Tuning Patterns

Classification: Add a linear layer on top of [CLS]. Fine-tune the whole stack end-to-end. 2-5 epochs usually sufficient.
Token classification (NER): Add a linear layer on top of each token representation. Predict a label per token.
Question answering: Concatenate question and context with [SEP]. Predict start/end token positions for the answer span.
Sentence similarity: Use SBERT (siamese BERT) — don't use vanilla [CLS] representations.

BERT Variants That Matter in Production

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →