Designing a Production RAG System: Full Architecture Walkthrough
Document ingestion pipeline, retrieval layer, reranker, answer policy, eval loop, and monitoring — all the decisions you need to make before you ship.
Building a RAG demo takes a weekend. Building a RAG system that works in production — one that handles messy documents, ambiguous queries, evolving knowledge bases, and real users who break things — takes months of iteration. This is the full architecture walkthrough: every component, every decision, every failure mode.
The hardest part of RAG is not the vector search. It's everything around it: document pipelines, index freshness, eval, observability. The retrieval itself is almost the easy part.
System overview
A production RAG system has four major subsystems: the data pipeline (ingestion and indexing), the retrieval pipeline (query-time), the generation pipeline (LLM call), and the operations layer (observability, evals, freshness). Most tutorials only show the middle two.
| Subsystem | Components | Failure cost |
|---|---|---|
| Data pipeline | Crawlers, parsers, chunkers, embedders, indexers | Silent — stale docs cause confident wrong answers |
| Retrieval pipeline | Query rewriter, vector search, reranker, context builder | Visible — user gets wrong or irrelevant answer |
| Generation pipeline | Prompt builder, LLM call, response formatter, citation extractor | Visible — model ignores context, hallucinates |
| Operations layer | Tracing, evals, freshness monitors, cost dashboards | Invisible until something breaks badly |
The data pipeline — where most teams underinvest
Document ingestion
Documents arrive in every format: PDFs with tables, HTML pages with nav menus, Word docs with headers, Confluence pages with embedded images. Each format needs a dedicated parser. A bad parser corrupts the chunk before it ever reaches the model.
- PDFs: use PyMuPDF or pdfplumber — not PyPDF2 (poor table handling). For scanned PDFs, you need OCR (Tesseract, AWS Textract, or Document AI).
- HTML: strip navigation, footers, ads before chunking. Beautiful Soup works; Trafilatura extracts article content better.
- Word/Excel: use python-docx and openpyxl. Preserve table structure — tables flattened to prose lose most of their meaning.
- Images and diagrams: either skip them (document clearly), extract captions, or use a vision model to generate descriptions.
Chunking strategy selection
| Strategy | Best for | Weakness |
|---|---|---|
| Fixed-size + overlap (512 tokens, 10%) | Quick start, homogeneous documents | Splits mid-sentence, mid-concept |
| Sentence-aware | Prose-heavy content (articles, manuals) | Sentences vary wildly in informativeness |
| Recursive character splitting | Mixed content — tries paragraph → sentence → word boundaries | Still arbitrary at boundaries |
| Semantic chunking | Best recall, topic-coherent chunks | Slow; needs embedding model at index time |
| Hierarchical (parent/child) | Long documents with clear sections | More complex index; two retrieval sizes |
Add a metadata envelope to every chunk: document ID, source URL, last-modified date, section title, chunk position (index/total). Retrieval quality without metadata is retrieval blindfolded — you can't filter by date, source, or section.
Embedding model choice
Use MTEB leaderboard as a starting point but benchmark on your domain. Production choices in 2025: OpenAI text-embedding-3-large (3072 dims, best general-purpose), Cohere embed-v3-english (1024 dims, strong on long documents), nomic-embed-text (768 dims, runs locally, surprisingly competitive). For multilingual: multilingual-e5-large or Cohere embed-v3-multilingual.
Index freshness — the silent killer
Knowledge goes stale. A policy updated six months ago is still in your index as the canonical version unless you actively expire it. Every production RAG system needs an index freshness strategy:
- Track last-modified date for every source document in chunk metadata
- Run a freshness monitor daily: flag chunks whose source has changed since indexing
- Set TTLs on chunks from volatile sources (news, pricing pages, live docs)
- When a document is updated, delete all chunks with that document ID before re-indexing
The most common production RAG incident: a user asks about your pricing and gets last year's rates. The model answers confidently with stale data because the old document has higher embedding similarity than the new one (which uses slightly different wording). Index freshness is not optional for anything time-sensitive.
The retrieval pipeline — advanced patterns
Query transformation
Raw user queries are often poor retrieval inputs. Users phrase things colloquially, with typos, with missing context. Three transformations that consistently improve retrieval:
- Multi-query expansion: generate 3 paraphrases of the query, retrieve for all, deduplicate. Adds one LLM call, lifts recall by 15–30%.
- HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query, embed that instead of the query. The hypothetical answer's vocabulary matches document vocabulary better than a short question.
- Step-back prompting: rewrite a specific query to its more general form before retrieval. 'What's the max refund for order #A1234?' → 'What is the refund policy?' — much better retrieval.
Reranking
First-stage retrieval (vector search) optimises for speed, not precision. A cross-encoder reranker scores each query-chunk pair together — much more accurate but O(n) forward passes. The production pattern: retrieve top-20 cheaply, rerank to top-5 accurately, send top-3 to the model. Cohere Rerank, BGE-reranker-large, and Jina Reranker are the common choices.
import cohere
co = cohere.Client()
def rerank(query: str, chunks: list[str], top_n: int = 5):
results = co.rerank(
query=query,
documents=chunks,
top_n=top_n,
model="rerank-english-v3.0",
)
return [chunks[r.index] for r in results.results]
# Usage: first retrieve 20, then rerank to 5
initial_chunks = vector_search(query, top_k=20)
precise_chunks = rerank(query, initial_chunks, top_n=5)
Context assembly
The context you inject into the LLM prompt is not just a paste of the top-K chunks. For production quality: order chunks by relevance (most relevant first AND last — lost-in-middle mitigation), add source attribution metadata, compress chunks that exceed a token budget using LLMLingua or selective sentence removal, and always include a 'no relevant information found' fallback instruction.
The generation pipeline
The LLM is the last component, but prompt design here matters more than model choice for most applications. Three things that dramatically improve generation quality:
- Citation instruction: 'When answering, cite the source by referring to its position: [Source 1], [Source 2]'. Forces the model to ground claims.
- Uncertainty instruction: 'If the provided context does not contain enough information to answer the question, say so explicitly. Do not extrapolate.' Reduces hallucinations on missing-context queries.
- Conflict instruction: 'If sources disagree, present both perspectives and note the disagreement.' Prevents silent resolution of conflicts.
The operations layer
Observability — what to instrument
| Signal | Why it matters | How to capture |
|---|---|---|
| Retrieval precision | Are retrieved chunks relevant? | Log chunk scores + user feedback correlation |
| Context utilisation | Does the model actually use the context? | NLI model: does answer follow from context? |
| Answer faithfulness | Are answers grounded in retrieved docs? | RAGAS faithfulness on sampled queries |
| Latency breakdown | Where is time spent? | Trace spans: embed + search + rerank + LLM |
| Index freshness | How stale is the knowledge base? | Track source modified dates vs. chunk indexed dates |
Evaluation pipeline
A RAG eval pipeline has three test sets: a golden set (hand-annotated query/answer pairs), a regression set (past failures — every incident adds an example), and a synthetic set (LLM-generated Q&A over your documents, cheap to create at scale). Run all three on every significant change to chunking, embedding model, retrieval config, or prompt.
Architecture sizing guide
| Scale | Vector DB | Embedding | Latency target | Approximate monthly infra cost |
|---|---|---|---|---|
| <100K chunks | Chroma (local) or Pinecone free | Any | No constraint | <$50 |
| 100K–5M chunks | Pinecone, Weaviate, Qdrant | text-embedding-3-small | <2s P95 | $200–800 |
| 5M–100M chunks | Weaviate / Qdrant (self-hosted) | text-embedding-3-large | <1s P95 | $1K–5K |
| >100M chunks | Distributed Qdrant / pgvector on Postgres | Custom fine-tuned | <500ms P95 | Custom |
Build a production RAG system in RAG Lab →: Configure every component of the RAG pipeline and run eval comparisons between configurations.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →