Multimodal RAG: Retrieval Over Images, Tables, and Mixed Documents
RAG breaks when your corpus has PDFs with charts, scanned documents, or product images. The three architectures for handling mixed-modal retrieval: late fusion, early fusion, and cross-modal reranking — with when each fails.
Standard RAG assumes your corpus is text. Chunk it, embed it, retrieve it. This works until your documents have tables, charts, diagrams, scanned PDFs, or product images — which is most real enterprise corpora. The moment you have non-text content, you need a strategy for how to represent, index, and retrieve it.
Multimodal RAG isn't one technique — it's a spectrum of approaches with very different tradeoffs. The right choice depends on what your corpus looks like, what your queries look like, and how much you care about accuracy vs. cost vs. latency.
The Core Problem
A PDF with a revenue chart is not retrievable by a text query about 'Q3 revenue growth' unless you either: (a) extract the chart's data as text, (b) embed the chart image alongside its context, or (c) use a model that can read the chart directly. Each approach has failure modes.
Architecture 1: Extract-Then-Embed (Late Fusion)
Use a document parsing pipeline (Amazon Textract, Google Document AI, Unstructured.io, or a multimodal LLM) to convert all non-text content into text before indexing. Charts become table summaries. Images get captions. Diagrams get descriptions. Then proceed with standard text RAG.
- Pros: works with any text embedding model, no multimodal retrieval required, cheap at query time.
- Cons: extraction quality is the ceiling — bad OCR or bad chart interpretation becomes permanent. Diagrams with spatial relationships lose that structure in text conversion.
- When to use: when your content is primarily text-heavy PDFs with occasional tables/charts, and you have budget for a good parsing pipeline.
Architecture 2: Separate Indexes, Re-rank Together (Early Fusion)
Maintain separate indexes: one for text chunks (text embeddings), one for images (CLIP embeddings). At query time, run the query against both indexes in parallel. Use a cross-modal reranker to score all retrieved candidates together and pick the best.
- Pros: preserves image information in its native form. CLIP retrieval is strong for semantically matching images to text queries.
- Cons: cross-modal reranking is expensive. Retrieved images must be shown to a multimodal LLM for the answer generation step, adding latency and cost.
- When to use: when images are primary content (product catalogs, medical imaging, technical diagrams) and visual content must be retrieved correctly.
Architecture 3: Multimodal Embeddings End-to-End
Use a model that produces a unified embedding for text, images, and mixed content — embedding them all in the same space. ColPali (from 2024) does this for PDF pages: it embeds each page as an image using a Vision-Language Model, producing multi-vector representations. Queries are embedded in the same space. Retrieval works directly over page images, no text extraction needed.
ColPali's insight: instead of extracting text from a PDF page, embed the page as an image and retrieve at the page level. This eliminates the extraction bottleneck entirely and handles layout-dependent content (tables, charts, multi-column text) that extraction pipelines mangle.
The Generation Step
Whatever retrieval architecture you use, if you retrieve images, your generation model must be multimodal. You can't feed a retrieved image to a text LLM. This means your answer generation step uses GPT-4V, Gemini, or an open-weight multimodal LLM — adding cost and latency compared to text-only RAG.
Failure Modes Specific to Multimodal RAG
- Chart hallucination: the LLM generates plausible-sounding numbers from a chart image that it cannot actually read accurately. Worse on charts with small text, complex scales, or overlapping bars.
- Image-text mismatch: a chart is retrieved but its surrounding context (what it measures, the date range, the source) is not. The LLM generates an answer without knowing what the chart represents.
- OCR compounding errors: bad extraction in the indexing step creates bad embeddings, creating bad retrieval, creating bad answers. The failure is invisible until you trace the retrieved chunk.
- Resolution degradation: images stored at low resolution in PDFs become unreadable even to vision models. Chart data is unrecoverable.
Never trust a multimodal RAG system on charts without a hallucination audit. Run a sample of visual Q&A tasks with ground truth numbers and measure exact-match accuracy on extracted values. LLMs are significantly less accurate at reading chart values than at reading text.
RAG Lab →: Simulate RAG failure modes in the interactive RAG Lab.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →