Chunking Strategies for RAG: Fixed, Semantic, and Hierarchical
Why chunk size is one of the most impactful RAG config decisions. Fixed-size vs. sentence vs. semantic chunking, with real retrieval quality differences.
Chunking is how you turn a large document into retrievable pieces. It sounds like a preprocessing detail. It is actually one of the most impactful configuration decisions in any RAG system.
Chunk too small and you lose context — the retrieved passage doesn't contain enough surrounding information for the model to answer. Chunk too large and you dilute relevance — the retrieved passage contains the answer buried in noise.
Fixed-size chunking
Split every document into chunks of N tokens with an overlap of M tokens. Fast, predictable, no dependencies. This is the default in most RAG tutorials and it's good enough to get started.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=64, # overlap between chunks
length_function=len,
)
chunks = splitter.split_text(document)
Fixed chunking splits mid-sentence, mid-table, and mid-code-block. If your documents have structure, this destroys it. A table split into 3 chunks will fail to retrieve correctly every time.
Semantic chunking
Instead of counting tokens, detect natural topic boundaries. Embed consecutive sentences and measure cosine similarity. When similarity drops sharply, you've hit a topic boundary — split there. This produces semantically coherent chunks at the cost of more computation at index time.
- Produces chunks with higher internal coherence — better retrieval precision
- Chunk sizes vary (some very short, some very long) — harder to predict latency
- Requires an embedding model at indexing time — more infrastructure
- Best for long-form documents with clear section structure
Hierarchical (parent-child) chunking
Store two chunk sizes: small child chunks for retrieval, large parent chunks for context. At query time, retrieve the small chunk (high precision), then fetch its parent and send the full parent to the LLM (full context). This is the best of both worlds.
Parent-child chunking consistently outperforms fixed chunking in benchmarks. The retriever sees small, precise chunks. The generator sees full, contextual passages. The split in responsibility is the key insight.
Choosing your chunk size
| Document type | Recommended chunk size | Overlap | Strategy |
|---|---|---|---|
| Q&A / FAQ | 128–256 tokens | 16 | Fixed — each Q&A is self-contained |
| Technical docs | 512 tokens | 64 | Fixed or parent-child |
| Legal / contracts | 256–512 tokens | 64 | Semantic — preserve clauses |
| Code | Function-level | 0 | Split on function/class boundaries |
| Earnings reports | Parent-child | N/A | Section headers as parents |
Compare chunk strategies in RAG Lab →: Index the same document with different strategies and see how retrieval precision changes.
- Contextual Retrieval — Anthropic (chunk-level context injection)
- Chunking Strategies for LLM Applications — Pinecone
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval (2024)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →