From Fine-Tuned Model to Production: Serving, Versioning, and Monitoring
The gap between a fine-tuned checkpoint and a production deployment. Adapter merging, quantisation for serving, model versioning strategy, and the monitoring signals that catch regressions.
Training a fine-tuned model is the easy part. Getting it into production reliably — with acceptable latency, cost, version control, and monitoring — is where most teams discover the real work. A fine-tuned checkpoint sitting on a training server is not a production model. It's a starting point.
This post covers the gap: what needs to happen between 'training complete' and 'serving production traffic' for a fine-tuned LLM.
Step 1: Merge and quantise for serving
If you trained with LoRA, merge the adapter into the base model weights before deployment. The merged model loads without the PEFT library, has no inference overhead, and is easier to serve with standard frameworks.
After merging, quantise for serving. Training in fp16 or bf16 is standard. Serving in fp16 doubles your memory requirement vs. INT8 quantisation, with minimal quality impact for most tasks. For cost-sensitive high-throughput deployments, GPTQ or AWQ INT4 quantisation reduces memory further.
fp16: ~2× base model size. Best quality. Use when latency/memory is not constrained.
INT8: ~1× base model size. ~1-2% quality degradation. Good default for production.
GPTQ INT4: ~0.5× base model size. ~2-5% degradation. Use for cost-sensitive high-volume.
AWQ INT4: ~0.5× base model size. Slightly better quality than GPTQ. Emerging standard.
Step 2: Model versioning strategy
Fine-tuned models need versioning like software. Every production checkpoint should be tagged with: base model version, training data hash, key hyperparameters, eval results, and deployment date. Use a model registry (MLflow, Weights & Biases, Hugging Face Hub private repos) to store checkpoints and their metadata.
Never overwrite a production model checkpoint. Always save the previous version before deploying a new one. When a fine-tuned model regresses in production, you need to roll back in minutes — not hours spent retraining.
Step 3: Serving infrastructure
Standard serving stacks for fine-tuned LLMs:
- vLLM: highest throughput for text generation, continuous batching, PagedAttention. Best for high-traffic endpoints.
- TGI (Hugging Face Text Generation Inference): strong production hardening, tensor parallelism, good OpenAI-compatible API.
- Ollama: simplest for local and small-scale deployment. Good for internal tools and development.
- Cloud managed inference: Modal, Replicate, Together AI — reduces infrastructure management at the cost of less control.
Step 4: Latency and throughput benchmarking
Before routing production traffic, benchmark your fine-tuned model against your baseline on latency and throughput. Measure TTFT (time to first token), TPS (tokens per second), and p95/p99 latency under realistic load. Fine-tuned models with identical architecture to the base model have identical latency — but confirm this, especially if you changed quantisation or model size.
Step 5: Production monitoring
- Quality signals: log a sample of (prompt, response) pairs. Run your eval pipeline on this sample asynchronously. Alert on quality metric drop >5% from baseline.
- Latency: monitor TTFT and e2e latency percentiles. Fine-tuned models can regress on latency if serving config isn't identical.
- Format compliance: for structured output tasks, monitor parse error rate. A sudden spike indicates a model regression.
- Distribution shift: monitor the distribution of input types. If users discover unexpected capability (or failure), the input distribution shifts — and your model may be serving out-of-distribution queries.
Step 6: Rollback plan
Define your rollback trigger before deployment: what metric, at what threshold, triggers an automatic or manual rollback to the previous model version. Have the previous checkpoint ready to serve with a config change, not a redeployment. Test the rollback procedure in staging before going live.
Explore LLMOps production patterns →: Design production serving, versioning, and monitoring for fine-tuned models.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →