AI Engineering 10 min read

ML CI/CD: Testing, Versioning, and Deploying LLM Pipelines

How to adapt software CI/CD for ML: prompt versioning, eval regression gates, canary deploys, and rollback for model updates.

Software engineers have CI/CD. They push code, tests run, and bad changes never reach production automatically. ML teams should have the same — but most don't, because 'testing a model' isn't the same as 'testing a function.' This post is about building the CI/CD pipeline that makes ML deployments safe.

What 'testing' means in an ML pipeline

Test type	What it catches	When to run
Data validation	Schema drift, missing values, distribution shift in training data	Before every training run
Training smoke test	Code bugs in training loop — crashes before epoch 1 completes	On every PR to training code
Eval suite	Quality regression vs. previous model version	After every training run, before any deployment
Serving tests	Model loads correctly, returns valid output, meets latency SLA	Before every serving deployment
Shadow comparison	New model vs. production model on real traffic	Before promoting to production
Canary health check	Error rate, latency, and quality signals on 5% traffic	During canary rollout

The ML CI/CD pipeline

name: ML Pipeline

on:
  push:
    paths: ['src/model/**', 'prompts/**', 'data/schemas/**']

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Validate training data schema
        run: python scripts/validate_data.py --schema data/schema.json

  eval-gate:
    needs: data-validation
    runs-on: ubuntu-latest
    steps:
      - name: Run eval suite against current changes
        run: python scripts/run_evals.py --baseline main --candidate HEAD
      - name: Check pass rate threshold
        run: |
          PASS_RATE=$(cat eval_results.json | jq '.pass_rate')
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Eval pass rate $PASS_RATE below threshold 0.85 — blocking deployment"
            exit 1
          fi

  serving-test:
    needs: eval-gate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: ./scripts/deploy_staging.sh
      - name: Run latency and smoke tests
        run: python scripts/serving_tests.py --env staging --max-p99-ms 3000

Prompt versioning as code

For LLM-heavy pipelines, prompts are the model. A prompt change is as significant as a weight change. Treat it that way: prompts in version control, semantic versioning (1.2.0 → 1.2.1 for wording tweaks, 1.3.0 for structural changes), eval suite runs on every prompt PR, and a clear rollback path.

# prompts/rag_answer_v1.2.0.txt is checked into git
# Registry loads by version, falling back to latest

class PromptRegistry:
    def get(self, name: str, version: str = "latest") -> str:
        if version == "latest":
            version = self._get_latest_version(name)
        path = f"prompts/{name}_v{version}.txt"
        return open(path).read()

    def promote(self, name: str, from_version: str, to_env: str):
        """Promote a prompt version to an environment after eval gate passes"""
        self._run_eval_gate(name, from_version)  # raises if fails
        self._write_env_config(name, from_version, to_env)

Build an ML CI/CD pipeline →: Configure eval gates and deployment automation in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →