Hamel Husain: Evals Are Everything
Hamel wrote the definitive guide to LLM evals. His core thesis: if you don't have evals, you don't have a product. A required read before shipping anything.
Who He Is
Hamel Husain is an ML engineer and independent consultant who has worked with GitHub, Airbnb, and a range of AI startups. He is best known for writing the most practical, opinionated guide to LLM evaluation that exists — and for his work on LLM fine-tuning in production, including contributions to the FastAI ecosystem.
Core Thesis
If you don't have evals, you don't have a product. Everything else — prompting, fine-tuning, RAG — is secondary to knowing whether your system works.
Key Themes
- Evals as the primary engineering discipline — not an afterthought, but the foundation of everything else
- Domain-specific data beats general models — a small, well-curated fine-tuning dataset beats a generic SOTA model for narrow tasks
- LLM-as-judge has known failure modes — evaluator bias, reference collapse, metric hacking
- Production fine-tuning workflow — data collection → cleaning → SFT → eval → iterate, not a one-shot process
- The 'your AI product needs evals' thesis — you can't improve what you can't measure
Essential Reading
| Resource | Format | Why It Matters |
|---|---|---|
| Your AI Product Needs Evals | Blog post | The best single piece on why evaluation is the core discipline of AI product engineering. |
| A Practical Guide to LLM Evals | hamel.dev | Step-by-step: what to measure, how to set up an eval harness, LLM-as-judge pitfalls. |
| Fine-tuning in Practice | Blog series | Real workflows: dataset curation, SFT, DPO, evaluation — not theoretical pipelines. |
| nbdev (FastAI) | Open-source | Notebook-driven development — his preferred environment for rapid ML experimentation. |
| hamel.dev | Blog | Ongoing: practical posts on what actually works in production ML, no hype. |
What to Question
Hamel's emphasis on domain-specific fine-tuning is well-placed but sometimes understates how far a well-prompted frontier model can go without fine-tuning. His eval frameworks are opinionated — they work best in structured task settings and require more adaptation for open-ended generation tasks.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →