Production & LLMOps 10 min read

ML Serving Containers: Patterns for Docker, GPU, and Production-Grade FastAPI

Multi-stage Docker builds for ML, model weights as volume mounts vs. baked artifacts, CUDA version pinning, health/readiness probes for Kubernetes, and the image size checklist. Why ML containers break differently from web containers.

Why ML Containers Break Differently From Web Containers

Dockerizing a Django app is solved. Dockerizing an ML serving stack is harder: GPU drivers must match between host and container, large model weights balloon image size, Python dependency hell intersects with CUDA version pinning, and you often need to serve the model with different runtime libraries than you used to train it.

The patterns below are specifically for ML serving — model inference in production containers, not training jobs.

Pattern 1: Multi-Stage Build for Model Serving

# Dockerfile — multi-stage build for a PyTorch serving container
# Stage 1: dependency builder (keeps pip cache out of final image)
FROM python:3.11-slim AS builder
WORKDIR /build

COPY requirements.txt .
RUN pip install --no-cache-dir --target /build/packages -r requirements.txt

# Stage 2: minimal runtime image
FROM python:3.11-slim AS runtime
WORKDIR /app

# Copy only installed packages, not pip cache
COPY --from=builder /build/packages /usr/local/lib/python3.11/site-packages/

# Copy model artefact (baked into image — simpler for immutable deploys)
COPY models/churn_v7.pkl /app/models/churn_v7.pkl

# Copy application code
COPY src/ /app/src/

# Non-root user for security
RUN useradd --no-create-home --shell /bin/false appuser
USER appuser

EXPOSE 8080
CMD ["uvicorn", "src.serve:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]

Pattern 2: Model Weights as a Volume Mount

Baking weights into the image keeps them immutable but bloats image size (a medium BERT model is ~500MB; GPT-style models are multi-GB). The alternative: store weights in S3/GCS and download at container startup, or mount them as a persistent volume.

# src/serve.py — model loaded from environment-specified path
import os, pickle
from fastapi import FastAPI

MODEL_PATH = os.environ.get("MODEL_PATH", "/models/model.pkl")

def load_model():
    if MODEL_PATH.startswith("s3://"):
        import boto3
        s3 = boto3.client("s3")
        bucket, key = MODEL_PATH[5:].split("/", 1)
        local_path = "/tmp/model.pkl"
        s3.download_file(bucket, key, local_path)
        return pickle.load(open(local_path, "rb"))
    return pickle.load(open(MODEL_PATH, "rb"))

app = FastAPI()
model = load_model()   # loaded once at startup

@app.post("/predict")
async def predict(payload: dict):
    return {"score": float(model.predict_proba([[payload["feature_1"], payload["feature_2"]]])[0, 1])}

Pattern 3: GPU Container Base Image Selection

The CUDA version in the container must match the driver version on the host. CUDA is not backward compatible across major versions. nvidia/cuda images encode the CUDA and cuDNN versions in the tag.

# requirements.txt pins must match Dockerfile base image
# nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 requires torch built for CUDA 12.1
torch==2.2.0+cu121
torchvision==0.17.0+cu121
--extra-index-url https://download.pytorch.org/whl/cu121

# Confirm at runtime:
# python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

Pattern 4: Health and Readiness Probes

from fastapi import FastAPI, Response

app = FastAPI()
_ready = False   # set True after model loads

@app.on_event("startup")
async def startup():
    global model, _ready
    model = load_model()
    _ready = True

@app.get("/health")
async def health():
    """Liveness probe — is the process alive?"""
    return {"status": "ok"}

@app.get("/ready")
async def ready(response: Response):
    """Readiness probe — is the model loaded and ready to serve?"""
    if not _ready:
        response.status_code = 503
        return {"status": "not_ready", "detail": "model loading"}
    return {"status": "ready"}

Image Size Checklist

Use slim or distroless base images. python:3.11-slim is ~130MB vs 1.2GB for the full python:3.11 image.
Run apt-get install && rm -rf /var/lib/apt/lists/* in a single RUN layer to avoid caching package lists.
Use multi-stage builds to exclude build tools from the final image.
Add .dockerignore to exclude notebooks, data files, .git, and __pycache__ from build context.
Consider distilling the model to a smaller format (ONNX, TorchScript) before baking in — often 50–80% size reduction.
Layer order matters for cache reuse: COPY requirements.txt first, RUN pip install, then COPY src/. Code changes don't invalidate the pip cache.

The single most impactful container optimisation for ML: separate model weights from the image layer. A 2GB model in the image means every CI push pulls 2GB. Use volume mounts or S3 download on startup for anything over ~100MB.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →