The Offline-Production Eval Gap: Why 91% RAGAS Doesn't Mean 91% User Satisfaction
Offline evals measure what you designed them to measure. Production measures what users actually need. The gap is where product quality lives — and most teams only discover it after shipping.
The number that doesn't predict what you think it predicts
A team builds a RAG system for customer support. They run RAGAS on a 500-question test set and get 0.91 faithfulness, 0.87 answer relevancy. They ship. Two weeks later: user satisfaction is at 58%. The support team is routing 40% of AI answers for human review. What happened?
What happened is the offline-production eval gap. Offline evals measure what the system does on the questions you anticipated. Production measures what the system does on the questions users actually ask. These are not the same distribution, they never will be, and the gap between them is where product quality lives.
Why offline eval distributions diverge from production
Test sets are built by product teams, engineers, or contractors who understand the system. They write clear, well-formed questions about the topics the system covers. Production users write ambiguous, misspelled, under-specified queries about topics that were never anticipated. They ask follow-up questions that assume context from a previous session. They paste in entire documents and ask 'what should I do with this?'
The divergence compounds over time. The longer your system is in production, the more the user population has evolved away from the test set authors' mental model of who the users are and what they need. A test set written in month one is a snapshot of anticipated use. Production by month six is something different.
If your test set was written by your team, it is already out of distribution relative to your users. This is not a fixable problem — it is a structural feature of the eval process. The correct response is not to write a better test set (though you should), but to add production eval signal alongside offline eval.
The signals that actually predict production quality
User behaviour is a better quality signal than automated metrics. Not the explicit signals — star ratings, thumbs up/down — which are noisy and sparse. The implicit signals: did the user ask a follow-up question immediately after the response? That suggests the answer was incomplete. Did the user rephrase and re-ask? That suggests the answer was wrong or off-topic. Did the user immediately navigate away? That suggests the answer was good enough or bad enough to end the session.
Session abandonment, follow-up question rate, rephrase rate, and copy-to-clipboard rate are all stronger predictors of response quality than faithfulness or relevancy scores. They are also harder to instrument and interpret. Most teams skip them.
Shadow eval: the bridge between offline and production
Shadow evaluation routes a sample of production queries through both the current model and a candidate model, without serving the candidate's responses to users. Both responses are then scored by an LLM judge or sampled for human review. The signal is real-distribution queries with controlled evaluation — not the clean test set, not raw production noise.
Shadow eval is the highest-quality pre-deployment signal available for LLM systems. It catches distribution shift (does the candidate handle the new query types that appeared in production this month?), regression on edge cases (the queries that the offline test set never covered), and quality improvements on the long tail of real queries.
Route 1–5% of production traffic to shadow eval continuously. Score it with an LLM judge on a daily sample of 100–200 queries. This gives you real-distribution signal without waiting for a full eval run. When the shadow score and the offline eval score diverge, the shadow score is right.
Closing the gap systematically
- Seed your test set from production logs: monthly, sample 100 real production queries, add them to the test set after labelling. The test set should grow toward the real distribution, not remain a static snapshot.
- Track production-to-offline correlation: measure the correlation between offline eval scores and production quality signals (follow-up rate, rephrase rate) over time. If correlation drops, the test set has drifted from production.
- Build failure taxonomy from production: when a production response fails (human override, user rephrase), log the failure mode. Build eval cases specifically targeting those failure modes.
- Never ship based on offline eval alone: treat offline eval as a necessary gate, not a sufficient one. Require shadow eval signal or A/B test signal before declaring any model change an improvement.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →