What Does an ML Engineer Actually Do in 2025?
The evolving ML Engineer role post-LLM revolution — what's changed, what's still core (training, MLOps, model serving), and how to position yourself.
The ML Engineer title covers a wide range — from building training pipelines for billion-parameter models to deploying fine-tuned classifiers in production microservices. Understanding what the role actually involves, how it differs from AI Engineer and Data Scientist, and what the career path looks like is essential reading before you apply.
What ML engineers actually do
ML Engineers sit at the intersection of software engineering and machine learning research. They write production code, but the code trains and serves models. Day-to-day work includes: building and maintaining training pipelines, curating and versioning training datasets, running experiments and tracking results, deploying models to serving infrastructure, monitoring model performance in production, and collaborating with researchers to productionise new techniques.
The distinction from Data Scientists: ML Engineers own the production path. A Data Scientist builds a model in a notebook; an ML Engineer turns it into a service that handles 10K requests per minute, fails gracefully, and can be retrained and redeployed in an hour.
ML Engineer vs AI Engineer — the 2025 distinction
| Dimension | ML Engineer | AI Engineer |
|---|---|---|
| Primary work | Training + fine-tuning models | Building on top of foundation models |
| Core skill | PyTorch / JAX, distributed training | Prompt engineering, RAG, agents, evals |
| Output | Model weights + serving infrastructure | LLM-powered applications |
| Infra depth | Deep — owns GPUs, distributed systems | Moderate — uses managed APIs |
| Math depth | High — loss functions, gradients | Moderate — uses models as black boxes |
| 2025 demand | High at labs and large tech | Rapidly growing across all sectors |
Core technical skills
- Python at a production level — not just scripts, but services with tests, types, and CI
- PyTorch or JAX — building, training, and debugging neural networks from scratch
- Distributed training — data parallelism, model parallelism, FSDP, DeepSpeed
- ML infrastructure — experiment tracking (MLflow, W&B), model registry, artifact storage
- Data pipelines — building reliable, reproducible data processing at scale
- Model serving — TorchServe, ONNX, TensorRT, vLLM, or Triton Inference Server
- Cloud ML platforms — SageMaker, Vertex AI, or Azure ML for managed training jobs
What companies want in 2025
Pre-2022, most ML engineering roles focused on classical models — tabular data, recommendation systems, NLP classifiers. Post-2022, the majority of new ML Engineering hiring is LLM-adjacent: fine-tuning foundation models, building RLHF pipelines, scaling training infrastructure for frontier model training, or deploying and serving large models efficiently.
The most in-demand specialisations: LLM fine-tuning (LoRA, QLoRA, full fine-tune at scale), inference optimisation (quantisation, speculative decoding, vLLM deployment), and training infrastructure (GPU cluster management, distributed training debugging).
Career progression
| Level | Scope | Key milestone |
|---|---|---|
| Junior MLE | Executes well-defined tasks on existing pipelines | Ships first model to production |
| Mid MLE | Owns a model or pipeline end-to-end | Reduces training time or serving cost by 2× |
| Senior MLE | Leads cross-functional ML projects | Designs the ML architecture for a new product |
| Staff MLE | Sets technical direction for an ML platform or area | Influence across multiple teams or products |
| Principal MLE | Org-level impact on ML strategy | Drives multi-year technical roadmap |
How to get in
The clearest path from SWE to MLE: build a project that requires training a model from scratch — not fine-tuning an existing one. Build the data pipeline, write the training loop, deploy the model, and monitor it. Show this project in interviews. Complement it with a strong understanding of transformers, backpropagation, and distributed systems.
The Karpathy path: watch 'Let's build GPT from scratch', implement it yourself, then implement GPT-2 training on a small dataset. This project — described confidently in interviews — opens more MLE doors than any certification.
Explore the AI careers section →: Salary guides, role comparisons, and breaking-in strategies for every AI role.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →