RAG & Retrieval 8 min read

Chunking Strategies for RAG: Fixed, Semantic, and Hierarchical

Why chunk size is one of the most impactful RAG config decisions. Fixed-size vs. sentence vs. semantic chunking, with real retrieval quality differences.

Chunking is how you turn a large document into retrievable pieces. It sounds like a preprocessing detail. It is actually one of the most impactful configuration decisions in any RAG system.

Chunk too small and you lose context — the retrieved passage doesn't contain enough surrounding information for the model to answer. Chunk too large and you dilute relevance — the retrieved passage contains the answer buried in noise.

Fixed-size chunking

Split every document into chunks of N tokens with an overlap of M tokens. Fast, predictable, no dependencies. This is the default in most RAG tutorials and it's good enough to get started.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # tokens per chunk
    chunk_overlap=64,    # overlap between chunks
    length_function=len,
)
chunks = splitter.split_text(document)

Fixed chunking splits mid-sentence, mid-table, and mid-code-block. If your documents have structure, this destroys it. A table split into 3 chunks will fail to retrieve correctly every time.

Semantic chunking

Instead of counting tokens, detect natural topic boundaries. Embed consecutive sentences and measure cosine similarity. When similarity drops sharply, you've hit a topic boundary — split there. This produces semantically coherent chunks at the cost of more computation at index time.

Produces chunks with higher internal coherence — better retrieval precision
Chunk sizes vary (some very short, some very long) — harder to predict latency
Requires an embedding model at indexing time — more infrastructure
Best for long-form documents with clear section structure

Hierarchical (parent-child) chunking

Store two chunk sizes: small child chunks for retrieval, large parent chunks for context. At query time, retrieve the small chunk (high precision), then fetch its parent and send the full parent to the LLM (full context). This is the best of both worlds.

Parent-child chunking consistently outperforms fixed chunking in benchmarks. The retriever sees small, precise chunks. The generator sees full, contextual passages. The split in responsibility is the key insight.

Choosing your chunk size

Document type	Recommended chunk size	Overlap	Strategy
Q&A / FAQ	128–256 tokens	16	Fixed — each Q&A is self-contained
Technical docs	512 tokens	64	Fixed or parent-child
Legal / contracts	256–512 tokens	64	Semantic — preserve clauses
Code	Function-level	0	Split on function/class boundaries
Earnings reports	Parent-child	N/A	Section headers as parents

Compare chunk strategies in RAG Lab →: Index the same document with different strategies and see how retrieval precision changes.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →