How RAG Actually Works — And Why It's Harder Than It Looks
The full retrieval-augmented generation pipeline: chunking → embedding → retrieval → reranking → generation. Where each step silently fails.
**No prerequisites — this is a great first post.** After this post you'll understand the full RAG pipeline: embed, store, retrieve, generate. More importantly, you'll know the three places it silently fails in production and why those failures are hard to catch.
RAG — Retrieval-Augmented Generation — solves a fundamental problem: LLMs know a lot, but they don't know your data.
A model trained in early 2024 doesn't know about your Q3 expense policy update, your internal engineering runbook, or your product's latest pricing. Fine-tuning to inject this knowledge is slow, expensive, and brittle. RAG is the practical alternative: retrieve the relevant information at query time and inject it into the prompt.
It sounds simple. It is not. Every step in the pipeline can fail in ways that are hard to detect — and the model will still answer confidently.
The full RAG pipeline
RAG has two distinct phases: indexing (offline, run once when data changes) and querying (online, run per user request).
Phase 1 — Indexing (offline)
- 1. Ingest documents — PDFs, wikis, databases, Notion pages, Slack channels
- 2. Chunk — split each document into overlapping fixed or semantic segments
- 3. Embed — convert each chunk into a dense vector using an embedding model
- 4. Store — write the vectors + chunk text + metadata into a vector database
Phase 2 — Querying (online)
- 5. Embed the query — same embedding model, same vector space
- 6. Retrieve — find the K nearest chunk vectors by cosine similarity
- 7. Rerank (optional) — re-score with a cross-encoder for precision
- 8. Augment — inject the top chunks into the LLM prompt as context
- 9. Generate — the LLM answers using only the provided context
Step 5 is critical: you must use the same embedding model at ingest and at query time. If you switch embedding models, your entire index becomes invalid — the vector spaces don't align.
Chunking — where most teams get it wrong
Chunking is the process of splitting documents into pieces small enough to retrieve individually. The chunk is what gets embedded, stored, and returned.
Too small: a chunk containing "₹1,800" with no surrounding context is meaningless. The retriever returns it, but the model can't use it.
Too large: a 2,000-token chunk covering 4 different policy topics dilutes the embedding — the vector represents everything at once, and retrieval precision drops.
Common strategies:
- Fixed-size — split every N tokens with M token overlap. Fast, consistent. Ignores document structure.
- Sentence-aware — split on sentence boundaries, group into N-sentence chunks. Better for prose.
- Semantic — use an embedding model to detect topic shifts and split there. Best precision, most compute.
- Hierarchical — store both sentence-level and paragraph-level chunks, retrieve at sentence level, return parent paragraph. Best of both.
Start with 512-token chunks and 10% overlap. Measure retrieval precision (how often the right chunk is in top-5). Adjust chunk size before tuning anything else — it has the largest single impact on RAG quality.
Embedding — what the vectors represent
An embedding model converts text into a dense vector — a list of 768–4096 floating point numbers. Semantically similar text produces similar vectors. Cosine similarity (the dot product of two normalised vectors) measures how close two meanings are.
The embedding model is a separate model from your LLM. Popular choices: OpenAI's text-embedding-3-large (3072 dims), Cohere embed-v3, or open-source models like bge-large-en (1024 dims, free to run).
The model you choose affects retrieval quality significantly. Benchmark on your domain — general benchmarks (MTEB) are a starting point but don't always predict domain-specific performance.
Retrieval — top-K and its tradeoffs
After embedding the query, you run an approximate nearest-neighbour search over your vector index and return the K most similar chunks.
| top_k setting | Behaviour | Risk |
|---|---|---|
| 1 | Fastest, cheapest | Single point of failure — if the top result is wrong, the answer is wrong |
| 3–5 | Good balance | Standard default for most RAG systems |
| 10+ | High recall | Context window pressure, noise from low-relevance chunks |
Reranking — the quality upgrade
First-stage retrieval uses bi-encoders (embed query and chunks separately, compare). They're fast but imprecise — similarity in vector space doesn't always equal relevance.
A cross-encoder reranker takes each query-chunk pair together, runs them through a smaller model, and produces a relevance score. Much slower (N forward passes instead of 1), but dramatically more precise.
Common pattern: retrieve top-20 cheaply, rerank to top-3 precisely. The reranker adds 20–100ms latency but can lift answer accuracy by 15–30% on complex queries.
Where RAG fails — silently
The model doesn't know when retrieval has failed. It answers with whatever context it was given, with the same confidence whether the context is correct or three years out of date.
- Stale documents — an old policy version is retrieved because it has higher embedding similarity. The model answers with outdated information.
- Conflicting documents — two versions of the same policy are retrieved. The model resolves the conflict silently, without flagging it.
- Missing context — the right chunk is retrieved but lacks surrounding detail needed to answer. The model confabulates the missing parts.
- Ambiguous query — the user's question has two meanings. The retriever picks one meaning and returns confident wrong results for the other.
- Context overflow — too many chunks get passed to the model, some get truncated, and the most relevant one gets dropped.
A RAG system that passes eval on clean test queries will fail on edge cases in production. The failure modes above all produce confident, fluent, wrong answers. Build evaluation that tests these specifically — not just "did the model answer something reasonable?"
We shipped our RAG system and it tested well. Then a user asked about our 2022 refund policy and got the 2019 version — confident, fluent, wrong. That's when I understood that 'it works in demos' means nothing.
Try it: Build a RAG configuration that avoids these failure modes →:
[Video: embedded video]
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
- Contextual Retrieval — Anthropic Research
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (2023)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →