GenAI Systems Lab Open interactive version →
RAG & Retrieval 10 min read

How RAG Actually Works — And Why It's Harder Than It Looks

The full retrieval-augmented generation pipeline: chunking → embedding → retrieval → reranking → generation. Where each step silently fails.

**No prerequisites — this is a great first post.** After this post you'll understand the full RAG pipeline: embed, store, retrieve, generate. More importantly, you'll know the three places it silently fails in production and why those failures are hard to catch.

RAG — Retrieval-Augmented Generation — solves a fundamental problem: LLMs know a lot, but they don't know your data.

A model trained in early 2024 doesn't know about your Q3 expense policy update, your internal engineering runbook, or your product's latest pricing. Fine-tuning to inject this knowledge is slow, expensive, and brittle. RAG is the practical alternative: retrieve the relevant information at query time and inject it into the prompt.

It sounds simple. It is not. Every step in the pipeline can fail in ways that are hard to detect — and the model will still answer confidently.

The full RAG pipeline

RAG has two distinct phases: indexing (offline, run once when data changes) and querying (online, run per user request).

Phase 1 — Indexing (offline)

Phase 2 — Querying (online)

Step 5 is critical: you must use the same embedding model at ingest and at query time. If you switch embedding models, your entire index becomes invalid — the vector spaces don't align.

Chunking — where most teams get it wrong

Chunking is the process of splitting documents into pieces small enough to retrieve individually. The chunk is what gets embedded, stored, and returned.

Too small: a chunk containing "₹1,800" with no surrounding context is meaningless. The retriever returns it, but the model can't use it.

Too large: a 2,000-token chunk covering 4 different policy topics dilutes the embedding — the vector represents everything at once, and retrieval precision drops.

Common strategies:

Start with 512-token chunks and 10% overlap. Measure retrieval precision (how often the right chunk is in top-5). Adjust chunk size before tuning anything else — it has the largest single impact on RAG quality.

Embedding — what the vectors represent

An embedding model converts text into a dense vector — a list of 768–4096 floating point numbers. Semantically similar text produces similar vectors. Cosine similarity (the dot product of two normalised vectors) measures how close two meanings are.

The embedding model is a separate model from your LLM. Popular choices: OpenAI's text-embedding-3-large (3072 dims), Cohere embed-v3, or open-source models like bge-large-en (1024 dims, free to run).

The model you choose affects retrieval quality significantly. Benchmark on your domain — general benchmarks (MTEB) are a starting point but don't always predict domain-specific performance.

Retrieval — top-K and its tradeoffs

After embedding the query, you run an approximate nearest-neighbour search over your vector index and return the K most similar chunks.

top_k settingBehaviourRisk
1Fastest, cheapestSingle point of failure — if the top result is wrong, the answer is wrong
3–5Good balanceStandard default for most RAG systems
10+High recallContext window pressure, noise from low-relevance chunks

Reranking — the quality upgrade

First-stage retrieval uses bi-encoders (embed query and chunks separately, compare). They're fast but imprecise — similarity in vector space doesn't always equal relevance.

A cross-encoder reranker takes each query-chunk pair together, runs them through a smaller model, and produces a relevance score. Much slower (N forward passes instead of 1), but dramatically more precise.

Common pattern: retrieve top-20 cheaply, rerank to top-3 precisely. The reranker adds 20–100ms latency but can lift answer accuracy by 15–30% on complex queries.

Where RAG fails — silently

The model doesn't know when retrieval has failed. It answers with whatever context it was given, with the same confidence whether the context is correct or three years out of date.

A RAG system that passes eval on clean test queries will fail on edge cases in production. The failure modes above all produce confident, fluent, wrong answers. Build evaluation that tests these specifically — not just "did the model answer something reasonable?"

We shipped our RAG system and it tested well. Then a user asked about our 2022 refund policy and got the 2019 version — confident, fluent, wrong. That's when I understood that 'it works in demos' means nothing.

Try it: Build a RAG configuration that avoids these failure modes →:

[Video: embedded video]

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →