GenAI Systems Lab Open interactive version →
Production & LLMOps 13 min read

Kubernetes for AI Workloads: GPU Scheduling, Model Loading, and LLM Autoscaling

Why standard K8s patterns break for LLMs. GPU resource requests. Model loading patterns (init containers, PVC, registry pull). KEDA vs HPA. PodDisruptionBudgets. Readiness vs liveness probes for model-serving pods.

Prerequisites: basic Kubernetes concepts (pods, deployments, services), Docker. After this post you will understand how to deploy LLM-backed agent services on K8s correctly: GPU scheduling, model loading patterns, custom autoscaling, disruption budgets, and probe configuration specific to AI workloads.

Kubernetes was designed for stateless web services. LLM workloads break nearly every assumption it was built on: pods have minutes-long startup times (model loading), GPUs cannot be overcommitted, CPU utilization is the wrong autoscaling signal, and draining a pod means dropping in-flight inference requests that take 10–30 seconds to complete.

Running LLMs on Kubernetes is not just adding GPU nodes. It requires rethinking scheduling, autoscaling, disruption handling, and the liveness/readiness contract.

GPU Scheduling

GPUs are a first-class K8s resource via the NVIDIA device plugin. Unlike CPU which can be overcommitted, GPU requests are exclusive — one pod claiming a GPU means no other pod can use it.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: llm-server
        image: your-org/llm-server:v1.2.0
        resources:
          requests:
            memory: "24Gi"
            cpu: "4"
            nvidia.com/gpu: "1"       # One full GPU
          limits:
            memory: "28Gi"
            nvidia.com/gpu: "1"       # limit == request — no GPU overcommit
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      nodeSelector:
        accelerator: nvidia-a100     # Pin to GPU node pool
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"         # Schedule on GPU-tainted nodes

Model Loading Patterns

Model weights are large (7B = ~14GB, 70B = ~140GB). How you get weights into a pod determines startup latency and scheduling flexibility.

Autoscaling: Why HPA Fails and KEDA Replaces It

Horizontal Pod Autoscaler scales on CPU or memory. For LLM workloads this is the wrong signal. An LLM pod can report 25% CPU utilization while completely saturated at the model level — the bottleneck is token throughput, not CPU cycles.

# KEDA ScaledObject — scale agent worker on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-worker-scaler
spec:
  scaleTargetRef:
    name: agent-worker
  minReplicaCount: 0          # Scale to zero when idle
  maxReplicaCount: 10
  cooldownPeriod: 300         # Wait 5 min before scaling down
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123/agent-tasks
      queueLength: "5"        # 1 replica per 5 queued messages
      awsRegion: us-east-1

PodDisruptionBudgets for LLM Serving

When a K8s node drains (maintenance, spot preemption, cluster upgrade), pods are evicted. Without protection, all replicas of your LLM deployment can be evicted simultaneously — your service goes dark.

# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-serving-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: llm-serving
---
# Graceful shutdown in deployment spec
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 120  # 2 min for in-flight requests
      containers:
      - name: llm-server
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 30"]  # Drain window

Readiness vs Liveness Probes for LLM Pods

The liveness/readiness distinction matters more for LLM pods because model loading takes minutes, not milliseconds.

The framing that helps: your K8s manifests are load-bearing, not boilerplate. A missing PDB means a routine node upgrade takes down your service. Wrong probe delays mean traffic hits pods before the model is loaded. KEDA misconfiguration means you pay for idle GPU capacity all night. Every line in your deployment spec is a decision about availability, cost, and correctness.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →