Kubernetes for AI Workloads: GPU Scheduling, Model Loading, and LLM Autoscaling
Why standard K8s patterns break for LLMs. GPU resource requests. Model loading patterns (init containers, PVC, registry pull). KEDA vs HPA. PodDisruptionBudgets. Readiness vs liveness probes for model-serving pods.
Prerequisites: basic Kubernetes concepts (pods, deployments, services), Docker. After this post you will understand how to deploy LLM-backed agent services on K8s correctly: GPU scheduling, model loading patterns, custom autoscaling, disruption budgets, and probe configuration specific to AI workloads.
Kubernetes was designed for stateless web services. LLM workloads break nearly every assumption it was built on: pods have minutes-long startup times (model loading), GPUs cannot be overcommitted, CPU utilization is the wrong autoscaling signal, and draining a pod means dropping in-flight inference requests that take 10–30 seconds to complete.
Running LLMs on Kubernetes is not just adding GPU nodes. It requires rethinking scheduling, autoscaling, disruption handling, and the liveness/readiness contract.
GPU Scheduling
GPUs are a first-class K8s resource via the NVIDIA device plugin. Unlike CPU which can be overcommitted, GPU requests are exclusive — one pod claiming a GPU means no other pod can use it.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-serving
spec:
replicas: 2
template:
spec:
containers:
- name: llm-server
image: your-org/llm-server:v1.2.0
resources:
requests:
memory: "24Gi"
cpu: "4"
nvidia.com/gpu: "1" # One full GPU
limits:
memory: "28Gi"
nvidia.com/gpu: "1" # limit == request — no GPU overcommit
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
nodeSelector:
accelerator: nvidia-a100 # Pin to GPU node pool
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule" # Schedule on GPU-tainted nodes
Model Loading Patterns
Model weights are large (7B = ~14GB, 70B = ~140GB). How you get weights into a pod determines startup latency and scheduling flexibility.
- Init container: separate container downloads weights from S3/GCS before the serving container starts. Clean isolation, but every pod restart re-downloads. PersistentVolumeClaim with pre-loaded weights: mount a PVC containing the model. Fast startup but PVC is tied to a specific node — reduces scheduling flexibility. Model registry pull on startup: serving container downloads from Hugging Face Hub or MLflow at startup. Most flexible for multi-model deployments, slowest to start. Node-local cache via DaemonSet: pre-warm new GPU nodes by pulling weights to node disk. Best startup time, requires node lifecycle management.
Autoscaling: Why HPA Fails and KEDA Replaces It
Horizontal Pod Autoscaler scales on CPU or memory. For LLM workloads this is the wrong signal. An LLM pod can report 25% CPU utilization while completely saturated at the model level — the bottleneck is token throughput, not CPU cycles.
- HPA failure mode: model at 100% throughput capacity, p95 latency 8x normal — but CPU at 25%. HPA does nothing. Scale happens too late or not at all. KEDA (Kubernetes Event-Driven Autoscaler): scales on external metrics — SQS queue depth, Redis list length, Prometheus gauges. Right signal for agent workers: task queue depth. Right signal for serving pods: inference request queue or token throughput. Scale-to-zero: KEDA supports 0 replicas when queue is empty. Critical cost lever for off-peak agent workloads. Use cooldownPeriod to prevent thrashing during bursty traffic.
# KEDA ScaledObject — scale agent worker on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-worker-scaler
spec:
scaleTargetRef:
name: agent-worker
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 10
cooldownPeriod: 300 # Wait 5 min before scaling down
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123/agent-tasks
queueLength: "5" # 1 replica per 5 queued messages
awsRegion: us-east-1
PodDisruptionBudgets for LLM Serving
When a K8s node drains (maintenance, spot preemption, cluster upgrade), pods are evicted. Without protection, all replicas of your LLM deployment can be evicted simultaneously — your service goes dark.
- PodDisruptionBudget (PDB): defines the minimum pods that must remain available during voluntary disruptions. Set minAvailable: 1 for any deployment with 2+ replicas. Graceful termination: set terminationGracePeriodSeconds to at least 2x your p99 inference latency. Default 30 seconds is too short for 10-second agent tasks. PreStop hook: adds a drain window between SIGTERM and pod shutdown, giving in-flight requests time to complete before the process exits.
# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: llm-serving-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: llm-serving
---
# Graceful shutdown in deployment spec
spec:
template:
spec:
terminationGracePeriodSeconds: 120 # 2 min for in-flight requests
containers:
- name: llm-server
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"] # Drain window
Readiness vs Liveness Probes for LLM Pods
The liveness/readiness distinction matters more for LLM pods because model loading takes minutes, not milliseconds.
- Readiness probe: checks if the pod is ready to serve inference. For LLM pods: HTTP GET /ready verifies model weights are loaded and tokenizer responds. Set initialDelaySeconds to cover model load time (2–15 minutes). Until readiness passes, the pod is excluded from the load balancer — not killed. Liveness probe: checks if the pod is stuck and needs a restart. Simple GET /health. Set failureThreshold high (3–5) to avoid restarting a pod processing a legitimate long request. Startup probe: for very slow-starting pods, use startupProbe with high failureThreshold to give the pod time to load before liveness kicks in. Prevents liveness from killing a pod still loading the model.
The framing that helps: your K8s manifests are load-bearing, not boilerplate. A missing PDB means a routine node upgrade takes down your service. Wrong probe delays mean traffic hits pods before the model is loaded. KEDA misconfiguration means you pay for idle GPU capacity all night. Every line in your deployment spec is a decision about availability, cost, and correctness.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →