Shadow Mode Testing: How to Compare Models Before You Switch
Run a new model in shadow mode alongside your live model, compare outputs without user exposure, and make data-driven upgrade decisions.
Here's the professional way to change your production model: you don't. Not yet. First, you run it in shadow mode — the new model sees every request, generates a response, but that response never reaches your users. You collect data. You compare. Then you decide.
Shadow mode testing is the responsible adult in the room when everyone else wants to just ship it and see what happens. It eliminates the single biggest risk in model upgrades: that your eval set doesn't represent the long tail of real production queries.
The architecture
import asyncio
async def handle_request(query, context):
# Primary: always serves the response
primary_task = asyncio.create_task(
call_model(PRODUCTION_MODEL, query, context)
)
# Shadow: runs in parallel, response is logged but NEVER returned to user
shadow_task = asyncio.create_task(
call_model(SHADOW_MODEL, query, context)
)
primary_response = await primary_task
# Don't await shadow in the critical path — fire and forget
asyncio.ensure_future(
log_shadow_comparison(query, primary_response, shadow_task)
)
return primary_response # Only primary reaches the user
async def log_shadow_comparison(query, primary_response, shadow_task):
try:
shadow_response = await asyncio.wait_for(shadow_task, timeout=30)
await store_comparison({
"query": query,
"primary": primary_response,
"shadow": shadow_response,
"timestamp": datetime.utcnow().isoformat()
})
except asyncio.TimeoutError:
log_metric("shadow_timeout")
What to measure
- Pairwise preference: LLM judge decides which response is better for a random 10% sample
- Semantic divergence: cosine similarity between primary and shadow responses — high divergence needs manual review
- Length distribution: if shadow responses are dramatically longer or shorter, worth investigating
- Tool call alignment: do both models call the same tools with the same arguments on agentic tasks?
- Failure rate: does the shadow model time out, error, or produce malformed output more often?
- Latency: would the shadow model's P99 meet your SLA if it were production?
When to graduate from shadow to production
After 7–14 days of shadow data (or reaching statistical significance on your pairwise preference score), you have a real answer. The threshold I'd recommend: shadow model wins or ties on pairwise preference at p < 0.05, lower or equal error rate, meets latency SLA. If it wins on all three, it earns a canary deployment (5% traffic). Then 25%. Then 100%.
Run shadow tests continuously, not just during planned upgrades. When a new model version releases, spin up a shadow run immediately. By the time you're ready to evaluate switching, you already have 2 weeks of production data.
Shadow testing setup →: Configure shadow routing in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →