GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

From Fine-Tuned Model to Production: Serving, Versioning, and Monitoring

The gap between a fine-tuned checkpoint and a production deployment. Adapter merging, quantisation for serving, model versioning strategy, and the monitoring signals that catch regressions.

Training a fine-tuned model is the easy part. Getting it into production reliably — with acceptable latency, cost, version control, and monitoring — is where most teams discover the real work. A fine-tuned checkpoint sitting on a training server is not a production model. It's a starting point.

This post covers the gap: what needs to happen between 'training complete' and 'serving production traffic' for a fine-tuned LLM.

Step 1: Merge and quantise for serving

If you trained with LoRA, merge the adapter into the base model weights before deployment. The merged model loads without the PEFT library, has no inference overhead, and is easier to serve with standard frameworks.

After merging, quantise for serving. Training in fp16 or bf16 is standard. Serving in fp16 doubles your memory requirement vs. INT8 quantisation, with minimal quality impact for most tasks. For cost-sensitive high-throughput deployments, GPTQ or AWQ INT4 quantisation reduces memory further.

fp16: ~2× base model size. Best quality. Use when latency/memory is not constrained.
INT8: ~1× base model size. ~1-2% quality degradation. Good default for production.
GPTQ INT4: ~0.5× base model size. ~2-5% degradation. Use for cost-sensitive high-volume.
AWQ INT4: ~0.5× base model size. Slightly better quality than GPTQ. Emerging standard.

Step 2: Model versioning strategy

Fine-tuned models need versioning like software. Every production checkpoint should be tagged with: base model version, training data hash, key hyperparameters, eval results, and deployment date. Use a model registry (MLflow, Weights & Biases, Hugging Face Hub private repos) to store checkpoints and their metadata.

Never overwrite a production model checkpoint. Always save the previous version before deploying a new one. When a fine-tuned model regresses in production, you need to roll back in minutes — not hours spent retraining.

Step 3: Serving infrastructure

Standard serving stacks for fine-tuned LLMs:

Step 4: Latency and throughput benchmarking

Before routing production traffic, benchmark your fine-tuned model against your baseline on latency and throughput. Measure TTFT (time to first token), TPS (tokens per second), and p95/p99 latency under realistic load. Fine-tuned models with identical architecture to the base model have identical latency — but confirm this, especially if you changed quantisation or model size.

Step 5: Production monitoring

Step 6: Rollback plan

Define your rollback trigger before deployment: what metric, at what threshold, triggers an automatic or manual rollback to the previous model version. Have the previous checkpoint ready to serve with a config change, not a redeployment. Test the rollback procedure in staging before going live.

Explore LLMOps production patterns →: Design production serving, versioning, and monitoring for fine-tuned models.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →