RAG Architectures: Naive, Advanced, Modular, and Agentic
How RAG has evolved from a simple retrieve-and-read loop to routing, query rewriting, self-RAG, corrective RAG, and full agentic retrieval.
RAG started simple: retrieve some chunks, paste them in the prompt. That naive approach works in demos. It fails in production. Over the past two years, RAG has evolved into a rich family of architectures — each fixing specific failure modes of the version before it.
[Video: IBM Technology — What is Retrieval-Augmented Generation (RAG)? (clear conceptual overview before the architecture deep-dive)]
Naive RAG
Index documents, embed the query, fetch top-K chunks, concatenate with the query, generate. This is the architecture in every tutorial.
- Breaks on: multi-hop questions, ambiguous queries, stale documents, keyword-heavy queries that semantic search misses
- Good for: simple Q&A over a single well-structured knowledge base
- When to use: prototyping, internal tools, low-stakes applications
Advanced RAG
Adds pre-retrieval and post-retrieval steps around the naive core. Pre-retrieval: query rewriting, HyDE (generating a hypothetical answer and using it as the query), query decomposition. Post-retrieval: reranking, context compression, citation grounding.
Query rewriting alone improves retrieval recall by 15–40% in most benchmarks. Instead of embedding the raw user query, generate 3 paraphrases and retrieve for all 3, then deduplicate and rerank. This costs one extra LLM call and consistently improves results.
Modular RAG
Treats the RAG pipeline as composable modules: query transformer, retriever, reranker, context compressor, generator, post-processor. Each module can be swapped independently. This is the architecture of production RAG systems at scale — routers decide which modules to invoke based on query type.
| Module | What it does | Example implementations |
|---|---|---|
| Query transformer | Rewrite, decompose, or expand the query | HyDE, step-back prompting, multi-query |
| Retriever | Fetch candidate chunks | Dense (vector), sparse (BM25), hybrid |
| Reranker | Score and filter retrieved chunks | Cross-encoder, Cohere Rerank, LLM judge |
| Compressor | Reduce retrieved context to essentials | LLMLingua, selective compression |
| Generator | Produce the answer | GPT-4o, Claude, Llama with citations |
Agentic RAG
The retriever becomes a tool that an agent can call multiple times, in sequence, with different queries. The agent plans its retrieval strategy based on the query and intermediate results. This handles multi-hop questions naturally — retrieve fact A, observe it, formulate a new query for fact B, retrieve, combine.
Agentic RAG is powerful but adds latency (multiple retrieval rounds), cost (multiple LLM calls), and failure surface (agent can loop or retrieve irrelevantly). Reach for it when simple RAG demonstrably fails on multi-hop or complex questions — not as a default architecture.
Self-RAG and corrective RAG
Self-RAG trains the model to emit special tokens deciding whether to retrieve, whether retrieved docs are relevant, and whether the final answer is grounded. Corrective RAG adds a retrieval evaluator that reroutes to web search if local retrieval quality is below threshold. Both treat retrieval as dynamic and conditional, not always-on.
Choosing your RAG architecture — a decision tree
The right architecture depends on your failure mode. Start with naive RAG. When you hit a wall, diagnose why and upgrade the specific component that's failing — not the whole pipeline.
| If this fails | Add this |
|---|---|
| Retrieval precision (wrong chunks returned) | Reranker (cross-encoder) |
| Retrieval recall (right chunk not found) | Query rewriting + multi-query retrieval |
| Keyword/ID queries miss | Hybrid search (vector + BM25) |
| Multi-hop questions fail | Agentic RAG with sequential retrieval |
| Context too sparse | Parent document retrieval / hierarchical chunks |
| Model ignores retrieved context | Contextual compression + citation prompting |
Configure RAG architecture in RAG Lab →: Switch between naive, advanced, and modular RAG configurations and measure the quality difference.
- Retrieval-Augmented Generation (RAG) — original paper (Lewis et al., 2020)
- Corrective Retrieval Augmented Generation (Yan et al., 2024)
- Agentic RAG: How agents can improve RAG — Anthropic
- Survey of RAG Techniques — arXiv (2024)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →