GenAI Systems Lab Open interactive version →
AI Engineering 15 min read

Designing a Production RAG System: Full Architecture Walkthrough

Document ingestion pipeline, retrieval layer, reranker, answer policy, eval loop, and monitoring — all the decisions you need to make before you ship.

Building a RAG demo takes a weekend. Building a RAG system that works in production — one that handles messy documents, ambiguous queries, evolving knowledge bases, and real users who break things — takes months of iteration. This is the full architecture walkthrough: every component, every decision, every failure mode.

The hardest part of RAG is not the vector search. It's everything around it: document pipelines, index freshness, eval, observability. The retrieval itself is almost the easy part.

System overview

A production RAG system has four major subsystems: the data pipeline (ingestion and indexing), the retrieval pipeline (query-time), the generation pipeline (LLM call), and the operations layer (observability, evals, freshness). Most tutorials only show the middle two.

SubsystemComponentsFailure cost
Data pipelineCrawlers, parsers, chunkers, embedders, indexersSilent — stale docs cause confident wrong answers
Retrieval pipelineQuery rewriter, vector search, reranker, context builderVisible — user gets wrong or irrelevant answer
Generation pipelinePrompt builder, LLM call, response formatter, citation extractorVisible — model ignores context, hallucinates
Operations layerTracing, evals, freshness monitors, cost dashboardsInvisible until something breaks badly

The data pipeline — where most teams underinvest

Document ingestion

Documents arrive in every format: PDFs with tables, HTML pages with nav menus, Word docs with headers, Confluence pages with embedded images. Each format needs a dedicated parser. A bad parser corrupts the chunk before it ever reaches the model.

Chunking strategy selection

StrategyBest forWeakness
Fixed-size + overlap (512 tokens, 10%)Quick start, homogeneous documentsSplits mid-sentence, mid-concept
Sentence-awareProse-heavy content (articles, manuals)Sentences vary wildly in informativeness
Recursive character splittingMixed content — tries paragraph → sentence → word boundariesStill arbitrary at boundaries
Semantic chunkingBest recall, topic-coherent chunksSlow; needs embedding model at index time
Hierarchical (parent/child)Long documents with clear sectionsMore complex index; two retrieval sizes

Add a metadata envelope to every chunk: document ID, source URL, last-modified date, section title, chunk position (index/total). Retrieval quality without metadata is retrieval blindfolded — you can't filter by date, source, or section.

Embedding model choice

Use MTEB leaderboard as a starting point but benchmark on your domain. Production choices in 2025: OpenAI text-embedding-3-large (3072 dims, best general-purpose), Cohere embed-v3-english (1024 dims, strong on long documents), nomic-embed-text (768 dims, runs locally, surprisingly competitive). For multilingual: multilingual-e5-large or Cohere embed-v3-multilingual.

Index freshness — the silent killer

Knowledge goes stale. A policy updated six months ago is still in your index as the canonical version unless you actively expire it. Every production RAG system needs an index freshness strategy:

The most common production RAG incident: a user asks about your pricing and gets last year's rates. The model answers confidently with stale data because the old document has higher embedding similarity than the new one (which uses slightly different wording). Index freshness is not optional for anything time-sensitive.

The retrieval pipeline — advanced patterns

Query transformation

Raw user queries are often poor retrieval inputs. Users phrase things colloquially, with typos, with missing context. Three transformations that consistently improve retrieval:

Reranking

First-stage retrieval (vector search) optimises for speed, not precision. A cross-encoder reranker scores each query-chunk pair together — much more accurate but O(n) forward passes. The production pattern: retrieve top-20 cheaply, rerank to top-5 accurately, send top-3 to the model. Cohere Rerank, BGE-reranker-large, and Jina Reranker are the common choices.

import cohere

co = cohere.Client()

def rerank(query: str, chunks: list[str], top_n: int = 5):
    results = co.rerank(
        query=query,
        documents=chunks,
        top_n=top_n,
        model="rerank-english-v3.0",
    )
    return [chunks[r.index] for r in results.results]

# Usage: first retrieve 20, then rerank to 5
initial_chunks = vector_search(query, top_k=20)
precise_chunks = rerank(query, initial_chunks, top_n=5)

Context assembly

The context you inject into the LLM prompt is not just a paste of the top-K chunks. For production quality: order chunks by relevance (most relevant first AND last — lost-in-middle mitigation), add source attribution metadata, compress chunks that exceed a token budget using LLMLingua or selective sentence removal, and always include a 'no relevant information found' fallback instruction.

The generation pipeline

The LLM is the last component, but prompt design here matters more than model choice for most applications. Three things that dramatically improve generation quality:

The operations layer

Observability — what to instrument

SignalWhy it mattersHow to capture
Retrieval precisionAre retrieved chunks relevant?Log chunk scores + user feedback correlation
Context utilisationDoes the model actually use the context?NLI model: does answer follow from context?
Answer faithfulnessAre answers grounded in retrieved docs?RAGAS faithfulness on sampled queries
Latency breakdownWhere is time spent?Trace spans: embed + search + rerank + LLM
Index freshnessHow stale is the knowledge base?Track source modified dates vs. chunk indexed dates

Evaluation pipeline

A RAG eval pipeline has three test sets: a golden set (hand-annotated query/answer pairs), a regression set (past failures — every incident adds an example), and a synthetic set (LLM-generated Q&A over your documents, cheap to create at scale). Run all three on every significant change to chunking, embedding model, retrieval config, or prompt.

Architecture sizing guide

ScaleVector DBEmbeddingLatency targetApproximate monthly infra cost
<100K chunksChroma (local) or Pinecone freeAnyNo constraint<$50
100K–5M chunksPinecone, Weaviate, Qdranttext-embedding-3-small<2s P95$200–800
5M–100M chunksWeaviate / Qdrant (self-hosted)text-embedding-3-large<1s P95$1K–5K
>100M chunksDistributed Qdrant / pgvector on PostgresCustom fine-tuned<500ms P95Custom

Build a production RAG system in RAG Lab →: Configure every component of the RAG pipeline and run eval comparisons between configurations.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →