AI Engineering 15 min read

Designing a Production RAG System: Full Architecture Walkthrough

Document ingestion pipeline, retrieval layer, reranker, answer policy, eval loop, and monitoring — all the decisions you need to make before you ship.

Building a RAG demo takes a weekend. Building a RAG system that works in production — one that handles messy documents, ambiguous queries, evolving knowledge bases, and real users who break things — takes months of iteration. This is the full architecture walkthrough: every component, every decision, every failure mode.

The hardest part of RAG is not the vector search. It's everything around it: document pipelines, index freshness, eval, observability. The retrieval itself is almost the easy part.

System overview

A production RAG system has four major subsystems: the data pipeline (ingestion and indexing), the retrieval pipeline (query-time), the generation pipeline (LLM call), and the operations layer (observability, evals, freshness). Most tutorials only show the middle two.

Subsystem	Components	Failure cost
Data pipeline	Crawlers, parsers, chunkers, embedders, indexers	Silent — stale docs cause confident wrong answers
Retrieval pipeline	Query rewriter, vector search, reranker, context builder	Visible — user gets wrong or irrelevant answer
Generation pipeline	Prompt builder, LLM call, response formatter, citation extractor	Visible — model ignores context, hallucinates
Operations layer	Tracing, evals, freshness monitors, cost dashboards	Invisible until something breaks badly

The data pipeline — where most teams underinvest

Document ingestion

Documents arrive in every format: PDFs with tables, HTML pages with nav menus, Word docs with headers, Confluence pages with embedded images. Each format needs a dedicated parser. A bad parser corrupts the chunk before it ever reaches the model.

PDFs: use PyMuPDF or pdfplumber — not PyPDF2 (poor table handling). For scanned PDFs, you need OCR (Tesseract, AWS Textract, or Document AI).
HTML: strip navigation, footers, ads before chunking. Beautiful Soup works; Trafilatura extracts article content better.
Word/Excel: use python-docx and openpyxl. Preserve table structure — tables flattened to prose lose most of their meaning.
Images and diagrams: either skip them (document clearly), extract captions, or use a vision model to generate descriptions.

Chunking strategy selection

Strategy	Best for	Weakness
Fixed-size + overlap (512 tokens, 10%)	Quick start, homogeneous documents	Splits mid-sentence, mid-concept
Sentence-aware	Prose-heavy content (articles, manuals)	Sentences vary wildly in informativeness
Recursive character splitting	Mixed content — tries paragraph → sentence → word boundaries	Still arbitrary at boundaries
Semantic chunking	Best recall, topic-coherent chunks	Slow; needs embedding model at index time
Hierarchical (parent/child)	Long documents with clear sections	More complex index; two retrieval sizes

Add a metadata envelope to every chunk: document ID, source URL, last-modified date, section title, chunk position (index/total). Retrieval quality without metadata is retrieval blindfolded — you can't filter by date, source, or section.

Embedding model choice

Use MTEB leaderboard as a starting point but benchmark on your domain. Production choices in 2025: OpenAI text-embedding-3-large (3072 dims, best general-purpose), Cohere embed-v3-english (1024 dims, strong on long documents), nomic-embed-text (768 dims, runs locally, surprisingly competitive). For multilingual: multilingual-e5-large or Cohere embed-v3-multilingual.

Index freshness — the silent killer

Knowledge goes stale. A policy updated six months ago is still in your index as the canonical version unless you actively expire it. Every production RAG system needs an index freshness strategy:

Track last-modified date for every source document in chunk metadata
Run a freshness monitor daily: flag chunks whose source has changed since indexing
Set TTLs on chunks from volatile sources (news, pricing pages, live docs)
When a document is updated, delete all chunks with that document ID before re-indexing

The most common production RAG incident: a user asks about your pricing and gets last year's rates. The model answers confidently with stale data because the old document has higher embedding similarity than the new one (which uses slightly different wording). Index freshness is not optional for anything time-sensitive.

The retrieval pipeline — advanced patterns

Query transformation

Raw user queries are often poor retrieval inputs. Users phrase things colloquially, with typos, with missing context. Three transformations that consistently improve retrieval:

Multi-query expansion: generate 3 paraphrases of the query, retrieve for all, deduplicate. Adds one LLM call, lifts recall by 15–30%.
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query, embed that instead of the query. The hypothetical answer's vocabulary matches document vocabulary better than a short question.
Step-back prompting: rewrite a specific query to its more general form before retrieval. 'What's the max refund for order #A1234?' → 'What is the refund policy?' — much better retrieval.

Reranking

First-stage retrieval (vector search) optimises for speed, not precision. A cross-encoder reranker scores each query-chunk pair together — much more accurate but O(n) forward passes. The production pattern: retrieve top-20 cheaply, rerank to top-5 accurately, send top-3 to the model. Cohere Rerank, BGE-reranker-large, and Jina Reranker are the common choices.

import cohere

co = cohere.Client()

def rerank(query: str, chunks: list[str], top_n: int = 5):
    results = co.rerank(
        query=query,
        documents=chunks,
        top_n=top_n,
        model="rerank-english-v3.0",
    )
    return [chunks[r.index] for r in results.results]

# Usage: first retrieve 20, then rerank to 5
initial_chunks = vector_search(query, top_k=20)
precise_chunks = rerank(query, initial_chunks, top_n=5)

Context assembly

The context you inject into the LLM prompt is not just a paste of the top-K chunks. For production quality: order chunks by relevance (most relevant first AND last — lost-in-middle mitigation), add source attribution metadata, compress chunks that exceed a token budget using LLMLingua or selective sentence removal, and always include a 'no relevant information found' fallback instruction.

The generation pipeline

The LLM is the last component, but prompt design here matters more than model choice for most applications. Three things that dramatically improve generation quality:

Citation instruction: 'When answering, cite the source by referring to its position: [Source 1], [Source 2]'. Forces the model to ground claims.
Uncertainty instruction: 'If the provided context does not contain enough information to answer the question, say so explicitly. Do not extrapolate.' Reduces hallucinations on missing-context queries.
Conflict instruction: 'If sources disagree, present both perspectives and note the disagreement.' Prevents silent resolution of conflicts.

The operations layer

Observability — what to instrument

Signal	Why it matters	How to capture
Retrieval precision	Are retrieved chunks relevant?	Log chunk scores + user feedback correlation
Context utilisation	Does the model actually use the context?	NLI model: does answer follow from context?
Answer faithfulness	Are answers grounded in retrieved docs?	RAGAS faithfulness on sampled queries
Latency breakdown	Where is time spent?	Trace spans: embed + search + rerank + LLM
Index freshness	How stale is the knowledge base?	Track source modified dates vs. chunk indexed dates

Evaluation pipeline

A RAG eval pipeline has three test sets: a golden set (hand-annotated query/answer pairs), a regression set (past failures — every incident adds an example), and a synthetic set (LLM-generated Q&A over your documents, cheap to create at scale). Run all three on every significant change to chunking, embedding model, retrieval config, or prompt.

Architecture sizing guide

Scale	Vector DB	Embedding	Latency target	Approximate monthly infra cost
<100K chunks	Chroma (local) or Pinecone free	Any	No constraint	<$50
100K–5M chunks	Pinecone, Weaviate, Qdrant	text-embedding-3-small	<2s P95	$200–800
5M–100M chunks	Weaviate / Qdrant (self-hosted)	text-embedding-3-large	<1s P95	$1K–5K
>100M chunks	Distributed Qdrant / pgvector on Postgres	Custom fine-tuned	<500ms P95	Custom

Build a production RAG system in RAG Lab →: Configure every component of the RAG pipeline and run eval comparisons between configurations.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →