AI Engineering 10 min read

The Original RAG Paper: Retrieval-Augmented Generation Explained

Facebook AI's 2020 paper that named and formalised the RAG pattern. What Lewis et al. actually proposed, how it differs from modern RAG stacks, and why the core insight still holds.

By 2020, fine-tuning had a well-known problem: knowledge went stale. A model fine-tuned on your documentation in January knew nothing about February. Updating it meant another expensive training run. And the model had no mechanism to cite where answers came from.

In May 2020, Patrick Lewis and colleagues at Facebook AI Research published 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks'. The proposal: retrieve relevant documents at inference time and condition generation on them. This paper named and formalised the RAG pattern that every production system uses today.

What the paper actually proposed

Two components: a retriever (Dense Passage Retrieval — a bi-encoder embedding questions and documents into a shared vector space) and a generator (BART). During inference: encode the question, retrieve top-k similar documents, concatenate with the question, generate an answer conditioned on the context.

What's different in modern RAG stacks

Aspect	Lewis et al. 2020	Modern Production RAG
Retriever	Dense Passage Retrieval	Embedding APIs (OpenAI, Cohere, Voyage)
Index	FAISS on Wikipedia	Pinecone, Weaviate, Qdrant, pgvector
Chunking	Fixed 100-word passages	Semantic / hierarchical with overlap
Reranking	Not included	Cross-encoder rerankers (Cohere, BGE)
Generator	BART seq2seq	GPT-4, Claude, Gemini
Hybrid search	Dense only	Dense + BM25 with RRF fusion

The core insight still holds: separating knowledge storage (retrieved documents) from reasoning capability (the generator) makes systems more updatable, more citable, and more debuggable. The retrieval step turns 'the model knows' into 'the model was given this document' — a critical shift for production trust.

What the paper didn't anticipate

Chunking strategy matters enormously: fixed 100-word passages miss semantic boundaries
Generator faithfulness: modern LLMs are better at instruction following — but also better at confabulating when context is ambiguous
RAG failure modes: stale docs, conflicting docs, missing context — don't appear on academic benchmarks but dominate production debugging
Latency at scale: serving retriever + generator pipeline with p99 latency constraints is a major production engineering problem

Run the RAG Lab — reproduce failure modes →: Configure a RAG system across the same dimensions the Lewis paper introduced.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →