AI Engineering 11 min read

How Multimodal LLMs Work: LLaVA, GPT-4V, and Gemini Architecture

How you connect a vision encoder to a language model — the projection layer, visual tokens, and instruction tuning approach behind LLaVA. How GPT-4V and Gemini differ architecturally, and what 'natively multimodal' actually means.

A multimodal LLM is not a fundamentally new architecture. It's a standard language model with an added visual perception module and a mechanism to inject image information into the token sequence. Understanding the three components — the vision encoder, the projection layer, and the LLM — tells you everything about why GPT-4V, LLaVA, and Gemini behave the way they do.

Component 1: The Vision Encoder

The vision encoder converts an image into a sequence of embeddings. In virtually all open-weight multimodal LLMs (LLaVA, InternVL, Idefics, CogVLM), the vision encoder is a frozen CLIP ViT — usually ViT-L/14 at 336px input resolution. A 336×336 image split into 14×14 patches gives (336/14)² = 576 patch embeddings. These are the raw visual features.

The encoder is typically kept frozen during multimodal training. The reasoning: CLIP already encodes rich visual semantics. Unfreezing it risks degrading those semantics for the sake of the language objective.

Component 2: The Projection Layer

The CLIP ViT outputs 576 embeddings of dimension 1024 (for ViT-L). The LLM expects token embeddings of a different dimension — say 4096 for a 7B model. The projection layer is what bridges them. Different models use different projections:

LLaVA-1: a single linear layer — cheap but lossy.
LLaVA-1.5: a two-layer MLP with GELU activation — better at preserving visual detail.
LLaVA-1.6 / LLaVA-NeXT: tile-based high-resolution with per-tile encoding — better for document and OCR tasks.
Flamingo (DeepMind): cross-attention layers interleaved in the LLM itself — more expressive but more expensive.

The projection layer is the most-trained component in multimodal LLM development. The vision encoder is frozen. The LLM is often frozen or lightly fine-tuned. The projection layer is trained from scratch to align the two modalities.

Component 3: The Language Model

The LLM receives a combined token sequence: projected image tokens + text tokens. From the LLM's perspective, it's just attending over a longer sequence. The image tokens are injected before the user's text prompt, so the model 'sees' the image before processing the question. Training teaches the model which image tokens to attend to when answering visual questions.

The Training Recipe (LLaVA approach)

Stage 1 — feature alignment: freeze both vision encoder and LLM. Train only the projection layer on a large image-caption dataset (~600k examples). Goal: teach the projection to translate visual features into tokens the LLM can interpret.
Stage 2 — instruction tuning: unfreeze the LLM (or LoRA fine-tune it). Train on visual instruction data (visual Q&A, image descriptions, reasoning tasks). Goal: teach the model to follow instructions about images.

How GPT-4V and Gemini Differ

OpenAI has not disclosed GPT-4V's architecture in detail. What's known: it uses a high-resolution tiling strategy (the image is split into tiles of up to 2048×2048 each processed separately and combined) and was trained with significantly more multimodal data than any open model.

Gemini was designed as natively multimodal from pretraining — meaning the LLM was pretrained jointly on text, images, audio, and video rather than having vision bolted on post-hoc. Google's claim is that joint pretraining from scratch produces deeper visual-language alignment than the two-stage approach. The practical result: Gemini is stronger at tasks that require interleaved text-image reasoning across a long context.

When choosing a multimodal model for production: open-weight models (LLaVA-1.6, InternVL2, Phi-3-Vision) are competitive with GPT-4V on structured tasks like document extraction and chart reading. GPT-4V has an edge on open-ended visual reasoning and handling unusual image types. Gemini 1.5 Pro has an edge on long documents with interleaved images.

Visual Token Count and Cost

576 visual tokens per image at 336px. GPT-4V's tiling: up to 2048 tokens for a high-resolution image. Every visual token is a transformer attention operation — multimodal calls are significantly more expensive than text-only calls. At GPT-4V pricing, a single high-resolution image costs ~$0.01–0.03 in tokens alone, before the response.

Read CLIP post →: CLIP: How Contrastive Vision-Language Pretraining Works

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →