AI Engineering 8 min read

Cold Start Latency: Why Your Serverless LLM Endpoint Is Spiking

Container cold starts, model loading, connection pool exhaustion — the latency cliff that hits low-traffic LLM endpoints. Diagnosis, warm-up strategies, and when to give up on serverless.

The demo went flawlessly on Tuesday morning. Wednesday night, a product review with an investor: the first response took 28 seconds. The second took 3 seconds. The third was fast. The investor asked if the product was always this slow.

It wasn't always slow. It was only slow on the first request after a period of inactivity — the classic cold start. In a serverless LLM deployment, cold starts are often 10-30× longer than warm requests. The latency profile your stress test showed you is not the latency profile your users experience.

What happens during a cold start

A cold start has several sequential phases, each adding latency:

Container provisioning: the cloud provider spins up a new container instance. Typically 1-3 seconds for standard container sizes.
Model loading: the model weights are loaded from storage into GPU/CPU memory. For large models (70B+), this can take 20-60 seconds. For smaller 7B models, 2-5 seconds.
Framework initialization: PyTorch, transformers, and your application code initialize. Add 1-5 seconds.
Connection pool warming: database connections, vector store clients, and HTTP connection pools establish their initial connections. Another 1-3 seconds.
First-request JIT compilation: some inference frameworks (torch.compile, TensorRT) perform just-in-time compilation on the first request shape they see. This can add 5-30 seconds on the first request only.

Diagnosis: is your latency spike a cold start?

Cold start latency has a distinctive signature: it appears only on the first request after a period of inactivity (typically 5-30 minutes depending on your provider's scale-to-zero policy), it doesn't affect throughput once warm, and it affects TTFT (time to first token) rather than tokens-per-second. If your p50 TTFT is 1.2s and your p99 is 24s, and the 24s requests cluster after low-traffic periods, you have a cold start problem.

Mitigation strategies

1. Scheduled warm-up pings

The simplest fix: send a lightweight synthetic request to your endpoint every 5 minutes. Most serverless providers keep containers alive for at least 5-15 minutes after the last request. This costs a small amount in warm-ping compute but eliminates cold starts for active users.

2. Minimum instance count

Set min_instances=1 on your serverless function. The container stays alive and warm, eliminating cold starts. This costs you one instance of compute continuously — evaluate whether that's cheaper than the cold start impact on user experience.

3. Lazy initialization → eager initialization

Move model loading and connection pool initialization from first-request time to container startup time. The container start takes longer, but once up, the first request is warm. This is the correct architecture for any LLM deployment — model loading should never happen on the request path.

4. User-facing mitigation

If cold starts are unavoidable (low-traffic app, cost constraints), be honest with UX: show a 'waking up...' state for first requests instead of a silent spinner. Users tolerate known waits better than unexplained delays.

The most important architectural rule: model weights should be loaded into memory before the first request arrives, not during it. Any framework that loads models on the request path will cold-start on every scale-up event.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →