GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Multimodal RAG: Retrieval Over Images, Tables, and Mixed Documents

RAG breaks when your corpus has PDFs with charts, scanned documents, or product images. The three architectures for handling mixed-modal retrieval: late fusion, early fusion, and cross-modal reranking — with when each fails.

Standard RAG assumes your corpus is text. Chunk it, embed it, retrieve it. This works until your documents have tables, charts, diagrams, scanned PDFs, or product images — which is most real enterprise corpora. The moment you have non-text content, you need a strategy for how to represent, index, and retrieve it.

Multimodal RAG isn't one technique — it's a spectrum of approaches with very different tradeoffs. The right choice depends on what your corpus looks like, what your queries look like, and how much you care about accuracy vs. cost vs. latency.

The Core Problem

A PDF with a revenue chart is not retrievable by a text query about 'Q3 revenue growth' unless you either: (a) extract the chart's data as text, (b) embed the chart image alongside its context, or (c) use a model that can read the chart directly. Each approach has failure modes.

Architecture 1: Extract-Then-Embed (Late Fusion)

Use a document parsing pipeline (Amazon Textract, Google Document AI, Unstructured.io, or a multimodal LLM) to convert all non-text content into text before indexing. Charts become table summaries. Images get captions. Diagrams get descriptions. Then proceed with standard text RAG.

Architecture 2: Separate Indexes, Re-rank Together (Early Fusion)

Maintain separate indexes: one for text chunks (text embeddings), one for images (CLIP embeddings). At query time, run the query against both indexes in parallel. Use a cross-modal reranker to score all retrieved candidates together and pick the best.

Architecture 3: Multimodal Embeddings End-to-End

Use a model that produces a unified embedding for text, images, and mixed content — embedding them all in the same space. ColPali (from 2024) does this for PDF pages: it embeds each page as an image using a Vision-Language Model, producing multi-vector representations. Queries are embedded in the same space. Retrieval works directly over page images, no text extraction needed.

ColPali's insight: instead of extracting text from a PDF page, embed the page as an image and retrieve at the page level. This eliminates the extraction bottleneck entirely and handles layout-dependent content (tables, charts, multi-column text) that extraction pipelines mangle.

The Generation Step

Whatever retrieval architecture you use, if you retrieve images, your generation model must be multimodal. You can't feed a retrieved image to a text LLM. This means your answer generation step uses GPT-4V, Gemini, or an open-weight multimodal LLM — adding cost and latency compared to text-only RAG.

Failure Modes Specific to Multimodal RAG

Never trust a multimodal RAG system on charts without a hallucination audit. Run a sample of visual Q&A tasks with ground truth numbers and measure exact-match accuracy on extracted values. LLMs are significantly less accurate at reading chart values than at reading text.

RAG Lab →: Simulate RAG failure modes in the interactive RAG Lab.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →