RAG & Retrieval 11 min read

RAG Architectures: Naive, Advanced, Modular, and Agentic

How RAG has evolved from a simple retrieve-and-read loop to routing, query rewriting, self-RAG, corrective RAG, and full agentic retrieval.

RAG started simple: retrieve some chunks, paste them in the prompt. That naive approach works in demos. It fails in production. Over the past two years, RAG has evolved into a rich family of architectures — each fixing specific failure modes of the version before it.

[Video: IBM Technology — What is Retrieval-Augmented Generation (RAG)? (clear conceptual overview before the architecture deep-dive)]

Naive RAG

Index documents, embed the query, fetch top-K chunks, concatenate with the query, generate. This is the architecture in every tutorial.

Breaks on: multi-hop questions, ambiguous queries, stale documents, keyword-heavy queries that semantic search misses
Good for: simple Q&A over a single well-structured knowledge base
When to use: prototyping, internal tools, low-stakes applications

Advanced RAG

Adds pre-retrieval and post-retrieval steps around the naive core. Pre-retrieval: query rewriting, HyDE (generating a hypothetical answer and using it as the query), query decomposition. Post-retrieval: reranking, context compression, citation grounding.

Query rewriting alone improves retrieval recall by 15–40% in most benchmarks. Instead of embedding the raw user query, generate 3 paraphrases and retrieve for all 3, then deduplicate and rerank. This costs one extra LLM call and consistently improves results.

Modular RAG

Treats the RAG pipeline as composable modules: query transformer, retriever, reranker, context compressor, generator, post-processor. Each module can be swapped independently. This is the architecture of production RAG systems at scale — routers decide which modules to invoke based on query type.

Module	What it does	Example implementations
Query transformer	Rewrite, decompose, or expand the query	HyDE, step-back prompting, multi-query
Retriever	Fetch candidate chunks	Dense (vector), sparse (BM25), hybrid
Reranker	Score and filter retrieved chunks	Cross-encoder, Cohere Rerank, LLM judge
Compressor	Reduce retrieved context to essentials	LLMLingua, selective compression
Generator	Produce the answer	GPT-4o, Claude, Llama with citations

Agentic RAG

The retriever becomes a tool that an agent can call multiple times, in sequence, with different queries. The agent plans its retrieval strategy based on the query and intermediate results. This handles multi-hop questions naturally — retrieve fact A, observe it, formulate a new query for fact B, retrieve, combine.

Agentic RAG is powerful but adds latency (multiple retrieval rounds), cost (multiple LLM calls), and failure surface (agent can loop or retrieve irrelevantly). Reach for it when simple RAG demonstrably fails on multi-hop or complex questions — not as a default architecture.

Self-RAG and corrective RAG

Self-RAG trains the model to emit special tokens deciding whether to retrieve, whether retrieved docs are relevant, and whether the final answer is grounded. Corrective RAG adds a retrieval evaluator that reroutes to web search if local retrieval quality is below threshold. Both treat retrieval as dynamic and conditional, not always-on.

Choosing your RAG architecture — a decision tree

The right architecture depends on your failure mode. Start with naive RAG. When you hit a wall, diagnose why and upgrade the specific component that's failing — not the whole pipeline.

If this fails	Add this
Retrieval precision (wrong chunks returned)	Reranker (cross-encoder)
Retrieval recall (right chunk not found)	Query rewriting + multi-query retrieval
Keyword/ID queries miss	Hybrid search (vector + BM25)
Multi-hop questions fail	Agentic RAG with sequential retrieval
Context too sparse	Parent document retrieval / hierarchical chunks
Model ignores retrieved context	Contextual compression + citation prompting

Configure RAG architecture in RAG Lab →: Switch between naive, advanced, and modular RAG configurations and measure the quality difference.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →