AI Engineering 11 min read

The RAG Paper → From Facebook AI Research to Production Retrieval Systems

Lewis et al. 2020 introduced RAG as a sequence-to-sequence model. What the original paper proposed, what the industry actually built on top of it, and how modern production RAG is fundamentally different from the original formulation.

In 2020, Patrick Lewis and colleagues at Facebook AI Research published 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.' The paper introduced RAG as a specific model architecture — a seq2seq generator (BART) combined with a DPR dense retrieval component, trained end-to-end. The retriever and generator weights are jointly optimized.

That is not what people mean when they say 'RAG' today. Modern production RAG looks almost nothing like the 2020 paper. The core insight survived; the implementation did not.

What the original RAG paper proposed

The 2020 paper trains a dense passage retriever (DPR) and a seq2seq generator (BART-large) together. Given a question, DPR retrieves top-k passages from Wikipedia. BART then conditions on the question + retrieved passages to generate the answer. The whole system is trained jointly on question-answering benchmarks like Natural Questions and TriviaQA.

The paper introduced two variants: RAG-Sequence (retrieve once per generation) and RAG-Token (retrieve fresh context for each output token). Both are elegant ideas. Neither is used in production.

What production RAG actually looks like

Production RAG decouples every component that the paper joined together. The retriever is not trained jointly with the generator. Embeddings come from a separate embedding model. The generator is a separately fine-tuned or prompted LLM. There is no joint backpropagation between them.

Retrieval: off-the-shelf embedding model (OpenAI, Cohere, or BGE), separate from the generator.
Storage: vector database (Pinecone, Weaviate, Qdrant, pgvector) — no Wikipedia-specific index.
Generator: a prompted LLM (GPT-4, Claude, Llama) — not fine-tuned for RAG specifically.
Pipeline: orchestration layer (LangChain, LlamaIndex, custom) that manages chunking, embedding, retrieval, reranking, prompt assembly.

The gaps the paper doesn't address

Chunking

The original paper retrieves fixed Wikipedia passages (~100 words). In production, you're ingesting arbitrary documents — PDFs, Notion pages, Slack threads, code files. How you chunk them is one of the most impactful decisions in your pipeline, and the paper doesn't mention it.

Reranking

The paper retrieves top-k by DPR score and passes them directly to the generator. Production systems add a reranking step — a cross-encoder that re-scores retrieved passages for relevance with full context. This dramatically improves precision at the cost of latency.

Query transformation

The paper passes raw queries to the retriever. Production systems often rewrite queries before retrieval — HyDE (hypothetical document embeddings), step-back prompting, multi-query expansion. A user's actual question is often not the best retrieval query.

Hallucination and groundedness

The paper evaluates answer accuracy on QA benchmarks. Production RAG needs to verify that generated answers are grounded in retrieved context — not just accurate on average. Citation tracking, faithfulness scoring, and groundedness metrics are production concerns absent from the paper.

The paper's core insight — combining parametric knowledge (in model weights) with non-parametric knowledge (in a retrieval index) — is correct and powerful. Everything else is implementation detail that production systems have replaced.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →