GenAI Systems Lab Open interactive version →
RAG & Retrieval 11 min read

RAG Architectures: Naive, Advanced, Modular, and Agentic

How RAG has evolved from a simple retrieve-and-read loop to routing, query rewriting, self-RAG, corrective RAG, and full agentic retrieval.

RAG started simple: retrieve some chunks, paste them in the prompt. That naive approach works in demos. It fails in production. Over the past two years, RAG has evolved into a rich family of architectures — each fixing specific failure modes of the version before it.

[Video: IBM Technology — What is Retrieval-Augmented Generation (RAG)? (clear conceptual overview before the architecture deep-dive)]

Naive RAG

Index documents, embed the query, fetch top-K chunks, concatenate with the query, generate. This is the architecture in every tutorial.

Advanced RAG

Adds pre-retrieval and post-retrieval steps around the naive core. Pre-retrieval: query rewriting, HyDE (generating a hypothetical answer and using it as the query), query decomposition. Post-retrieval: reranking, context compression, citation grounding.

Query rewriting alone improves retrieval recall by 15–40% in most benchmarks. Instead of embedding the raw user query, generate 3 paraphrases and retrieve for all 3, then deduplicate and rerank. This costs one extra LLM call and consistently improves results.

Modular RAG

Treats the RAG pipeline as composable modules: query transformer, retriever, reranker, context compressor, generator, post-processor. Each module can be swapped independently. This is the architecture of production RAG systems at scale — routers decide which modules to invoke based on query type.

ModuleWhat it doesExample implementations
Query transformerRewrite, decompose, or expand the queryHyDE, step-back prompting, multi-query
RetrieverFetch candidate chunksDense (vector), sparse (BM25), hybrid
RerankerScore and filter retrieved chunksCross-encoder, Cohere Rerank, LLM judge
CompressorReduce retrieved context to essentialsLLMLingua, selective compression
GeneratorProduce the answerGPT-4o, Claude, Llama with citations

Agentic RAG

The retriever becomes a tool that an agent can call multiple times, in sequence, with different queries. The agent plans its retrieval strategy based on the query and intermediate results. This handles multi-hop questions naturally — retrieve fact A, observe it, formulate a new query for fact B, retrieve, combine.

Agentic RAG is powerful but adds latency (multiple retrieval rounds), cost (multiple LLM calls), and failure surface (agent can loop or retrieve irrelevantly). Reach for it when simple RAG demonstrably fails on multi-hop or complex questions — not as a default architecture.

Self-RAG and corrective RAG

Self-RAG trains the model to emit special tokens deciding whether to retrieve, whether retrieved docs are relevant, and whether the final answer is grounded. Corrective RAG adds a retrieval evaluator that reroutes to web search if local retrieval quality is below threshold. Both treat retrieval as dynamic and conditional, not always-on.

Choosing your RAG architecture — a decision tree

The right architecture depends on your failure mode. Start with naive RAG. When you hit a wall, diagnose why and upgrade the specific component that's failing — not the whole pipeline.

If this failsAdd this
Retrieval precision (wrong chunks returned)Reranker (cross-encoder)
Retrieval recall (right chunk not found)Query rewriting + multi-query retrieval
Keyword/ID queries missHybrid search (vector + BM25)
Multi-hop questions failAgentic RAG with sequential retrieval
Context too sparseParent document retrieval / hierarchical chunks
Model ignores retrieved contextContextual compression + citation prompting

Configure RAG architecture in RAG Lab →: Switch between naive, advanced, and modular RAG configurations and measure the quality difference.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →