AI Engineering 12 min read

Multimodal RAG in Production: Images, Tables, and PDFs at Scale

RAG for documents with embedded images, charts, and tables. ColPali's visual document retrieval without OCR, late-interaction retrieval for PDFs, how to handle image-dense corpora with vision encoders, and production architectures for enterprise document Q&A.

The Problem with Standard RAG on Real Documents

Standard RAG extracts text, chunks it, embeds it, and retrieves by cosine similarity. This works for pure text documents. It fails catastrophically for enterprise documents: PDFs with charts that encode critical information, tables where structure matters, scanned documents where OCR loses formatting, and slides where layout is meaning.

Approach 1: Extract-Then-Embed

The traditional approach: use OCR to extract text from PDFs, use vision models to caption images, extract tables into markdown or CSV, then embed all of it. Tools like Unstructured.io, LlamaParse, and Azure Document Intelligence handle the extraction layer.

Failure mode: a chart showing quarterly revenue trend has a caption 'Figure 3' and maybe an alt-text 'revenue chart'. OCR-based extraction loses all the actual data in the chart. Retrieval on 'what was Q3 revenue?' returns the wrong chunk 60%+ of the time.

Approach 2: ColPali — Visual Document Retrieval

ColPali (2024) takes a fundamentally different approach: render each PDF page as an image, encode it with a vision encoder (PaliGemma), and use late-interaction retrieval (similar to ColBERT) directly on the visual embeddings. No OCR. No text extraction. The model retrieves the right *page* based on visual similarity.

Retrieves correctly on chart-heavy, table-heavy, and scanned documents
Late interaction: query tokens interact with page patch tokens at retrieval time
Performance: state-of-the-art on DocVQA and ViDoRe benchmarks
Cost: each page becomes ~1000 patch embeddings — index size scales with document count

Production Architecture for Multimodal RAG

In production, you typically combine approaches: use ColPali for initial page retrieval (finds the right page even in image-heavy docs), then pass the retrieved page image directly to a vision-capable LLM (GPT-4V, Claude 3, Gemini) for final answer generation. This 'retrieve page → generate from image' pattern handles 90%+ of real enterprise document Q&A cases.

Document Type	Recommended Approach	Tool
Text-only PDFs	Text extraction + text embedding	Unstructured.io + OpenAI embeddings
Mixed text/image PDFs	ColPali retrieval + vision LLM generation	byaldi + GPT-4V/Claude
Scanned documents	ColPali (no OCR dependency)	byaldi
Structured tables	Extract to markdown + text RAG	LlamaParse + GPT-4
Slide decks	Page-as-image retrieval	ColPali or GPT-4V on slide images

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →