RAG & Retrieval 5 min read

Why Chunk Boundaries Kill RAG Quality

When a paragraph is split mid-sentence, neither fragment is semantically complete. Both get mediocre similarity scores. The retrieved chunk is a fragment — and the model answers from a fragment. The failure is invisible in retrieval metrics.

The legal team queries their RAG system about compensation structure in an employment contract. The retrieved chunk has a similarity score of 0.81 — high confidence. The chunk text reads: "...which shall be calculated as follows: base salary multiplied by the performance coefficient defined in Schedule B, subject to a maximum of three times the base rate." The model answers that compensation is calculated by multiplying salary by a performance coefficient. The user's actual question was what the performance coefficient is. That definition was cut off at the chunk boundary, in the previous chunk, which was not retrieved.

This is the chunk boundary failure. It is common, it is consequential, and it is nearly invisible in standard retrieval metrics, because the retrieved chunk scored well — the retrieval system did its job. The failure happened upstream, at indexing time, before any query was ever run.

Fixed-size chunking splits documents at character or token count thresholds without regard for semantic structure. A paragraph that defines a term and then uses it gets cut at an arbitrary boundary. The first chunk contains the definition but not the application. The second chunk contains the application but not the definition. Each chunk is semantically incomplete — it has the right vocabulary but is missing the context that makes the vocabulary meaningful.

When an incomplete chunk is embedded, its vector captures only part of the semantic content. A query about performance coefficient calculation generates an embedding that partially matches both chunks — but neither matches as well as a complete, coherent paragraph would. The retrieval system selects the highest-scoring fragment. The model answers from the fragment. The answer is wrong in a way that looks completely correct given what was retrieved.

Original paragraph (180 tokens, semantically complete):
  "The performance coefficient is defined as the ratio of achieved KPIs
  to target KPIs over the review period per Schedule B. Compensation
  shall be calculated as follows: base salary multiplied by this
  coefficient, subject to a maximum of 3x the base rate."

  Similarity to query "performance coefficient calculation": 0.89

After fixed-size split at 90 tokens:

  Chunk A: "The performance coefficient is defined as the ratio of
  achieved KPIs to target KPIs over the review period per..."
  → Similarity to query: 0.71  (definition, no calculation)

  Chunk B: "...Schedule B. Compensation shall be calculated as follows:
  base salary multiplied by this coefficient, subject to max 3x..."
  → Similarity to query: 0.74  ← retrieved (calculation, no definition)

Both fragments score worse than the complete paragraph.
Chunk B retrieved. Model answers without knowing what the coefficient is.

The standard remedies have two components. Semantic chunking splits on detected topic or sentence boundaries rather than token counts, keeping conceptually coherent units together. Overlapping chunks include the last N tokens of the previous chunk at the start of the next — a 10–20% overlap catches most boundary splits without doubling storage. Either fix applied alone is meaningfully better than fixed-size chunking. Both applied together are substantially better.

The fundamental paradox of chunk boundary failure is that better retrieval metrics will not expose it. Retrieval precision measures whether the right chunk scored in the top K. If the right chunk is a fragment that scores 0.74, and no complete chunk exists because the document was split at indexing time, retrieval reports success. The failure was baked in during index construction, and no retrieval tuning, reranking, or prompt engineering can recover information that was never stored as a coherent unit.

Chunk boundary failures are invisible in retrieval metrics because the fragment is retrieved successfully — the actual failure happened at indexing time, when a semantically complete unit was split in half and neither fragment embeds well enough to reliably surface the full meaning a query needs.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →