GenAI Systems Lab Open interactive version →
Production & LLMOps 10 min read

Blue-Green, Canary, Shadow, Champion-Challenger: ML Deployment Patterns Explained

Four patterns for deploying models without causing incidents. When to use each. The user-consistent canary routing trick, shadow mode for zero-risk validation, and champion-challenger for permanent experimentation. With implementation code.

Four Patterns for Deploying Models to Production

The naive deployment pattern is: train model, replace the running model, watch metrics. This is how teams cause incidents. The right question before any deployment is: what's the rollback path if the new model is wrong?

There are four patterns worth knowing. Each trades risk against feedback speed and infrastructure complexity.

Pattern 1: Blue-Green Deployment

Two identical environments. Blue serves 100% of traffic. Green runs the new model and is validated in isolation. When green passes smoke tests, you flip the load balancer: green becomes the live environment, blue is on standby for instant rollback.

# Nginx config: blue-green toggle via upstream block
upstream model_blue  { server blue-model:8080; }
upstream model_green { server green-model:8080; }

# To flip: swap the proxy_pass line
server {
    location /predict {
        proxy_pass http://model_green;  # change to model_blue to roll back
    }
}

Strengths: instant rollback (sub-second load balancer flip), clean separation between environments, easy to automate. Weakness: requires double the infrastructure cost while both environments are running.

Pattern 2: Canary Deployment

Route a small percentage of traffic — 1%, 5%, 10% — to the new model. Monitor business metrics on the canary slice. Gradually ramp up if metrics hold. Kill it immediately if they degrade.

# Python: deterministic canary routing by user ID
import hashlib

def route_to_canary(user_id: str, canary_pct: float = 0.05) -> bool:
    """Deterministic: same user always hits same model during rollout."""
    h = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    return h < (canary_pct * 100)

# At inference time
def predict(user_id: str, features: dict) -> dict:
    if route_to_canary(user_id):
        result = new_model.predict(features)
        log_event("canary_prediction", user_id=user_id, result=result)
    else:
        result = stable_model.predict(features)
    return result

The crucial detail: canary routing must be user-consistent, not request-random. If a user hits the old model for one request and the new model for the next, you pollute both signals and introduce UX inconsistency. Hash on user_id, not on request_id.

Pattern 3: Shadow Mode

The new model runs alongside the production model and receives every request, but its predictions are discarded. Only the production model's output is served. This lets you collect a real production distribution of inputs and compare predictions without any user exposure.

import threading

def shadow_predict(features: dict, user_id: str) -> dict:
    # Serve production model result synchronously
    prod_result = production_model.predict(features)
    
    # Run new model asynchronously — discard output, log for comparison
    def _shadow():
        shadow_result = new_model.predict(features)
        log_comparison({
            "user_id": user_id,
            "prod_score": prod_result["score"],
            "shadow_score": shadow_result["score"],
            "agreed": prod_result["label"] == shadow_result["label"]
        })
    threading.Thread(target=_shadow, daemon=True).start()
    
    return prod_result  # user never sees shadow result

Pattern 4: Champion-Challenger

An ongoing A/B test framework where the production model (champion) always serves the majority of traffic, and one or more challenger models receive a fixed minority slice. Unlike canary, this is a permanent split used to continuously evaluate candidates before promoting any to champion.

When to Use Which

The most common mistake: deploying a model with no rollback mechanism at all. Blue-green is not 'gold standard overhead' — it's the minimum responsible deployment for a model in a user-facing path.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →