ML CI/CD: Testing, Versioning, and Deploying LLM Pipelines
How to adapt software CI/CD for ML: prompt versioning, eval regression gates, canary deploys, and rollback for model updates.
Software engineers have CI/CD. They push code, tests run, and bad changes never reach production automatically. ML teams should have the same — but most don't, because 'testing a model' isn't the same as 'testing a function.' This post is about building the CI/CD pipeline that makes ML deployments safe.
What 'testing' means in an ML pipeline
| Test type | What it catches | When to run |
|---|---|---|
| Data validation | Schema drift, missing values, distribution shift in training data | Before every training run |
| Training smoke test | Code bugs in training loop — crashes before epoch 1 completes | On every PR to training code |
| Eval suite | Quality regression vs. previous model version | After every training run, before any deployment |
| Serving tests | Model loads correctly, returns valid output, meets latency SLA | Before every serving deployment |
| Shadow comparison | New model vs. production model on real traffic | Before promoting to production |
| Canary health check | Error rate, latency, and quality signals on 5% traffic | During canary rollout |
The ML CI/CD pipeline
name: ML Pipeline
on:
push:
paths: ['src/model/**', 'prompts/**', 'data/schemas/**']
jobs:
data-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate training data schema
run: python scripts/validate_data.py --schema data/schema.json
eval-gate:
needs: data-validation
runs-on: ubuntu-latest
steps:
- name: Run eval suite against current changes
run: python scripts/run_evals.py --baseline main --candidate HEAD
- name: Check pass rate threshold
run: |
PASS_RATE=$(cat eval_results.json | jq '.pass_rate')
if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
echo "Eval pass rate $PASS_RATE below threshold 0.85 — blocking deployment"
exit 1
fi
serving-test:
needs: eval-gate
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: ./scripts/deploy_staging.sh
- name: Run latency and smoke tests
run: python scripts/serving_tests.py --env staging --max-p99-ms 3000
Prompt versioning as code
For LLM-heavy pipelines, prompts are the model. A prompt change is as significant as a weight change. Treat it that way: prompts in version control, semantic versioning (1.2.0 → 1.2.1 for wording tweaks, 1.3.0 for structural changes), eval suite runs on every prompt PR, and a clear rollback path.
# prompts/rag_answer_v1.2.0.txt is checked into git
# Registry loads by version, falling back to latest
class PromptRegistry:
def get(self, name: str, version: str = "latest") -> str:
if version == "latest":
version = self._get_latest_version(name)
path = f"prompts/{name}_v{version}.txt"
return open(path).read()
def promote(self, name: str, from_version: str, to_env: str):
"""Promote a prompt version to an environment after eval gate passes"""
self._run_eval_gate(name, from_version) # raises if fails
self._write_env_config(name, from_version, to_env)
Build an ML CI/CD pipeline →: Configure eval gates and deployment automation in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →