Vision Transformers (ViT): How Images Become Tokens
The architecture that replaced CNNs for vision tasks. How patch embeddings work, why transformers scale better than convolutions, and what the attention mechanism sees in an image.
For a decade, convolutional neural networks (CNNs) dominated computer vision. The inductive biases built into CNNs — local connectivity, translation invariance, parameter sharing — made them ideal for processing images. Then, in 2020, Google Brain published 'An Image is Worth 16×16 Words' and showed that a pure transformer, with none of those inductive biases, could match or beat CNNs on image classification — if you trained it on enough data.
The Vision Transformer (ViT) is architecturally almost identical to the original text transformer. The only difference is how the input is represented. Text transformers tokenize words. ViT tokenizes image patches.
Patch Embeddings: Turning Images into Tokens
Given a 224×224 input image, ViT splits it into a grid of fixed-size patches — typically 16×16 pixels each. That gives you (224/16)² = 196 patches. Each 16×16×3 patch (3 for RGB channels) is flattened into a vector of 768 dimensions and linearly projected. These patch embeddings are the tokens. They're fed into a standard transformer encoder.
ViT treats each image patch the same way BERT treats each word token. The transformer has no built-in knowledge that patch 5 is spatially adjacent to patch 6 — that spatial information is injected via learnable positional embeddings added to each patch embedding.
The Architecture
- Patch embedding layer: linear projection of flattened patches to d_model dimensions.
- Class token [CLS]: a learnable vector prepended to the sequence, used as the classification embedding.
- Positional embeddings: learnable 1D embeddings added to each patch position (not 2D — this was a deliberate simplification that worked fine empirically).
- Transformer encoder: L layers of multi-head self-attention + MLP blocks, identical to the text transformer.
- Classification head: an MLP applied to the [CLS] token output to produce class logits.
Why Transformers Scale Better Than CNNs
CNNs have inductive biases that help on small datasets but limit what they can learn from large ones. Self-attention in ViT is fully global — every patch attends to every other patch regardless of spatial distance. This means ViT can learn long-range dependencies that CNNs miss (the sky and the ground being semantically related, for example).
The cost is that ViT needs much more data to learn spatial structure from scratch (since it has no translation invariance built in). Training ViT on ImageNet alone (1.2M images) gives worse results than a CNN. Training on JFT-300M (300M images) gives better. The empirical rule: ViT wins when you have data at scale.
What Attention Sees in Images
Visualizing attention weights in ViT produces something genuinely useful: the model learns to attend to semantically meaningful regions. Even without any object detection supervision, ViT attention heads learn to segment foreground from background. Early layers attend locally (nearby patches). Deep layers attend globally (cross-image semantic relationships).
ViT Variants
| Model | Key Change | Use Case |
|---|---|---|
| DeiT (Meta) | Training efficiency — trains competitively on ImageNet alone via distillation | When you don't have JFT-scale data |
| Swin Transformer (Microsoft) | Hierarchical patches with local windows — merges patches as it deepens | Dense prediction tasks (detection, segmentation) |
| BEiT (Microsoft) | Masked image modeling pretraining (like BERT for images) | Self-supervised vision pretraining |
| MAE (Meta) | Masked autoencoder — reconstruct 75% masked patches | Scalable self-supervised ViT pretraining |
In most production multimodal systems today, the vision encoder is a ViT variant (typically ViT-L/14 at 14×14 patch size, 336px input) pretrained with CLIP. The output patch embeddings from this CLIP-ViT become the visual tokens fed to the LLM.
Concepts →: Understand transformer attention in the Concepts module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →