GenAI Systems Lab Open interactive version →
AI Engineering 13 min read

Build a Minimal RAG in 50 Lines (No Framework)

Chunk, embed, retrieve, prompt, generate — the full RAG loop without LangChain or LlamaIndex. Free to run on Colab using sentence-transformers. Once you have built it manually, every framework abstraction becomes legible and every failure mode becomes findable.

LangChain exists to abstract away the pieces of a RAG pipeline so you can assemble them quickly. The cost of that abstraction is that you stop understanding what is actually happening. When something breaks — the wrong chunks are retrieved, the answer is hallucinated, latency is unexpectedly high — you are debugging configuration, not code. You do not know what to look at.

The fix is to build RAG manually first. 50 lines. No framework. Free to run on Colab. Once you have done it, every LangChain abstraction becomes legible — you know what it replaced and why. This post is that 50-line build.

What the pipeline actually does

RAG is five steps: chunk the source documents into retrievable pieces, embed each chunk into a vector, store those vectors, embed the user query the same way, find the most similar chunks by cosine similarity, inject those chunks into a prompt, and call an LLM. That is the entire pattern. Every production RAG system is a variation on this loop with better chunking, better retrieval, reranking, and evaluation on top.

The implementation

# Run this cell first
!pip install sentence-transformers -q
import numpy as np
from sentence_transformers import SentenceTransformer

# ── 1. Your corpus ────────────────────────────────────────────────────────────
# In production: load PDFs, split by paragraph, clean whitespace.
# Here: plain strings simulating pre-chunked documents.
corpus = [
    "Transformers replaced RNNs because self-attention processes all tokens in parallel, enabling training on much larger datasets.",
    "The context window is the maximum number of tokens a model can attend to at once. GPT-4 supports up to 128k tokens.",
    "Retrieval-augmented generation reduces hallucination by grounding the model in retrieved documents before generating.",
    "BM25 is a sparse retrieval method that uses term frequency and inverse document frequency without neural embeddings.",
    "Dense retrieval encodes queries and documents as vectors; retrieval is a nearest-neighbour search in embedding space.",
    "Chunking strategy affects retrieval quality significantly. Fixed-size chunks are simple; semantic chunks preserve meaning better.",
    "Reranking uses a cross-encoder to re-score the top-k retrieved chunks — more accurate than bi-encoder retrieval but slower.",
    "Hallucination in RAG often occurs when the retrieved context does not contain the answer but the model generates one anyway.",
    "Fine-tuning adapts a pre-trained model using labelled examples. It changes the model weights, unlike RAG which leaves them unchanged.",
    "The embedding model used for retrieval must match between indexing time and query time or retrieval quality collapses.",
]

# ── 2. Embed corpus (downloads ~80MB model on first run, cached after) ────────
model = SentenceTransformer("all-MiniLM-L6-v2")
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)  # (10, 384)

# ── 3. Retrieve: embed query, cosine similarity, return top-k ─────────────────
def retrieve(query, k=3):
    q_emb = model.encode([query], normalize_embeddings=True)   # (1, 384)
    scores = corpus_embeddings @ q_emb.T                        # (10, 1)
    top_k_idx = np.argsort(scores[:, 0])[::-1][:k]
    return [(corpus[i], float(scores[i, 0])) for i in top_k_idx]

# ── 4. Build the prompt ───────────────────────────────────────────────────────
def build_prompt(query, chunks):
    context = "\n\n".join(
        f"[Chunk {i+1}] {chunk}" for i, (chunk, _) in enumerate(chunks)
    )
    return f"""You are a helpful assistant. Answer using only the context below.
If the answer is not in the context, say "I don't know based on the provided context."

Context:
{context}

Question: {query}
Answer:"""

# ── 5. Run it ─────────────────────────────────────────────────────────────────
query = "Why does RAG reduce hallucination?"
chunks = retrieve(query, k=3)

print("Retrieved chunks:")
for chunk, score in chunks:
    print(f"  [similarity={score:.3f}] {chunk[:80]}...")

prompt = build_prompt(query, chunks)
print("\n" + "="*60)
print("Prompt ready to send to any LLM:")
print("="*60)
print(prompt)

# ── 6. Send to an LLM (add your own key) ────────────────────────────────────
# Option A — OpenAI:
# from openai import OpenAI
# client = OpenAI(api_key="sk-...")
# response = client.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=[{"role": "user", "content": prompt}]
# )
# print(response.choices[0].message.content)
#
# Option B — Free via HuggingFace Inference API:
# import requests
# API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"
# headers = {"Authorization": "Bearer hf_..."}
# response = requests.post(API_URL, headers=headers, json={"inputs": prompt})
# print(response.json()[0]["generated_text"])

What to observe when this runs

The retrieval step takes about 50ms for 10 documents. For 100k documents you would use FAISS instead of numpy dot product — but the cosine computation is identical. Note which chunks are returned and whether they actually answer the query. That mismatch — the retriever surfacing chunks that are topically similar but do not contain the specific answer — is the 'missing context' failure mode. The prompt explicitly instructs the model to say 'I don't know' when context is insufficient, but LLMs frequently ignore this instruction. That is the hallucination failure mode.

Run the same query with k=1, k=3, k=5. Watch what happens to the prompt length. At k=5 you are using significantly more of the context window. At some point you are retrieving chunks that are not relevant, adding noise to the prompt and increasing the chance the model anchors on the wrong information. This is the retrieval precision vs. recall tradeoff made visible.

Try breaking it intentionally: change the embedding model to a different one between corpus indexing and query encoding. The similarity scores collapse — you get garbage retrieval. This is the 'embedding model mismatch' failure mode that silently breaks RAG systems when models are updated without re-indexing.

What to add next

This pipeline has no chunking (corpus is pre-split), no reranking, no evaluation, and no streaming. Each addition teaches something. Add chunking: split a real PDF by paragraph and observe how chunk boundaries affect what gets retrieved. Add a reranker: take the top-10 BM25 results and re-score them with a cross-encoder — watch precision improve at the cost of latency. Add FAISS: replace the numpy dot product with a FAISS IndexFlatIP — same results, 100x faster at scale. Add an eval harness: create 10 question-answer pairs from your corpus and measure Recall@3 automatically after any config change.

LangChain, LlamaIndex, and every other RAG framework is wrapping exactly these steps. Having built them manually, you now read their documentation differently — you are reading about configuration choices for operations you understand, not magic.

RAG Lab — configure the failure modes →: The RAG Lab puts you inside a production pipeline with real failure modes. Run this 50-line version first, then the lab shows you what each failure looks like at scale.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →