GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

ML CI/CD: Testing, Versioning, and Deploying LLM Pipelines

How to adapt software CI/CD for ML: prompt versioning, eval regression gates, canary deploys, and rollback for model updates.

Software engineers have CI/CD. They push code, tests run, and bad changes never reach production automatically. ML teams should have the same — but most don't, because 'testing a model' isn't the same as 'testing a function.' This post is about building the CI/CD pipeline that makes ML deployments safe.

What 'testing' means in an ML pipeline

Test typeWhat it catchesWhen to run
Data validationSchema drift, missing values, distribution shift in training dataBefore every training run
Training smoke testCode bugs in training loop — crashes before epoch 1 completesOn every PR to training code
Eval suiteQuality regression vs. previous model versionAfter every training run, before any deployment
Serving testsModel loads correctly, returns valid output, meets latency SLABefore every serving deployment
Shadow comparisonNew model vs. production model on real trafficBefore promoting to production
Canary health checkError rate, latency, and quality signals on 5% trafficDuring canary rollout

The ML CI/CD pipeline

name: ML Pipeline

on:
  push:
    paths: ['src/model/**', 'prompts/**', 'data/schemas/**']

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Validate training data schema
        run: python scripts/validate_data.py --schema data/schema.json

  eval-gate:
    needs: data-validation
    runs-on: ubuntu-latest
    steps:
      - name: Run eval suite against current changes
        run: python scripts/run_evals.py --baseline main --candidate HEAD
      - name: Check pass rate threshold
        run: |
          PASS_RATE=$(cat eval_results.json | jq '.pass_rate')
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Eval pass rate $PASS_RATE below threshold 0.85 — blocking deployment"
            exit 1
          fi

  serving-test:
    needs: eval-gate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: ./scripts/deploy_staging.sh
      - name: Run latency and smoke tests
        run: python scripts/serving_tests.py --env staging --max-p99-ms 3000

Prompt versioning as code

For LLM-heavy pipelines, prompts are the model. A prompt change is as significant as a weight change. Treat it that way: prompts in version control, semantic versioning (1.2.0 → 1.2.1 for wording tweaks, 1.3.0 for structural changes), eval suite runs on every prompt PR, and a clear rollback path.

# prompts/rag_answer_v1.2.0.txt is checked into git
# Registry loads by version, falling back to latest

class PromptRegistry:
    def get(self, name: str, version: str = "latest") -> str:
        if version == "latest":
            version = self._get_latest_version(name)
        path = f"prompts/{name}_v{version}.txt"
        return open(path).read()

    def promote(self, name: str, from_version: str, to_env: str):
        """Promote a prompt version to an environment after eval gate passes"""
        self._run_eval_gate(name, from_version)  # raises if fails
        self._write_env_config(name, from_version, to_env)

Build an ML CI/CD pipeline →: Configure eval gates and deployment automation in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →