GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

Vision Transformers (ViT): How Images Become Tokens

The architecture that replaced CNNs for vision tasks. How patch embeddings work, why transformers scale better than convolutions, and what the attention mechanism sees in an image.

For a decade, convolutional neural networks (CNNs) dominated computer vision. The inductive biases built into CNNs — local connectivity, translation invariance, parameter sharing — made them ideal for processing images. Then, in 2020, Google Brain published 'An Image is Worth 16×16 Words' and showed that a pure transformer, with none of those inductive biases, could match or beat CNNs on image classification — if you trained it on enough data.

The Vision Transformer (ViT) is architecturally almost identical to the original text transformer. The only difference is how the input is represented. Text transformers tokenize words. ViT tokenizes image patches.

Patch Embeddings: Turning Images into Tokens

Given a 224×224 input image, ViT splits it into a grid of fixed-size patches — typically 16×16 pixels each. That gives you (224/16)² = 196 patches. Each 16×16×3 patch (3 for RGB channels) is flattened into a vector of 768 dimensions and linearly projected. These patch embeddings are the tokens. They're fed into a standard transformer encoder.

ViT treats each image patch the same way BERT treats each word token. The transformer has no built-in knowledge that patch 5 is spatially adjacent to patch 6 — that spatial information is injected via learnable positional embeddings added to each patch embedding.

The Architecture

Why Transformers Scale Better Than CNNs

CNNs have inductive biases that help on small datasets but limit what they can learn from large ones. Self-attention in ViT is fully global — every patch attends to every other patch regardless of spatial distance. This means ViT can learn long-range dependencies that CNNs miss (the sky and the ground being semantically related, for example).

The cost is that ViT needs much more data to learn spatial structure from scratch (since it has no translation invariance built in). Training ViT on ImageNet alone (1.2M images) gives worse results than a CNN. Training on JFT-300M (300M images) gives better. The empirical rule: ViT wins when you have data at scale.

What Attention Sees in Images

Visualizing attention weights in ViT produces something genuinely useful: the model learns to attend to semantically meaningful regions. Even without any object detection supervision, ViT attention heads learn to segment foreground from background. Early layers attend locally (nearby patches). Deep layers attend globally (cross-image semantic relationships).

ViT Variants

ModelKey ChangeUse Case
DeiT (Meta)Training efficiency — trains competitively on ImageNet alone via distillationWhen you don't have JFT-scale data
Swin Transformer (Microsoft)Hierarchical patches with local windows — merges patches as it deepensDense prediction tasks (detection, segmentation)
BEiT (Microsoft)Masked image modeling pretraining (like BERT for images)Self-supervised vision pretraining
MAE (Meta)Masked autoencoder — reconstruct 75% masked patchesScalable self-supervised ViT pretraining

In most production multimodal systems today, the vision encoder is a ViT variant (typically ViT-L/14 at 14×14 patch size, 336px input) pretrained with CLIP. The output patch embeddings from this CLIP-ViT become the visual tokens fed to the LLM.

Concepts →: Understand transformer attention in the Concepts module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →