AI Engineering

Hamel Husain: Evals Are Everything

Hamel wrote the definitive guide to LLM evals. His core thesis: if you don't have evals, you don't have a product. A required read before shipping anything.

Who He Is

Hamel Husain is an ML engineer and independent consultant who has worked with GitHub, Airbnb, and a range of AI startups. He is best known for writing the most practical, opinionated guide to LLM evaluation that exists — and for his work on LLM fine-tuning in production, including contributions to the FastAI ecosystem.

Core Thesis

If you don't have evals, you don't have a product. Everything else — prompting, fine-tuning, RAG — is secondary to knowing whether your system works.

Key Themes

Evals as the primary engineering discipline — not an afterthought, but the foundation of everything else
Domain-specific data beats general models — a small, well-curated fine-tuning dataset beats a generic SOTA model for narrow tasks
LLM-as-judge has known failure modes — evaluator bias, reference collapse, metric hacking
Production fine-tuning workflow — data collection → cleaning → SFT → eval → iterate, not a one-shot process
The 'your AI product needs evals' thesis — you can't improve what you can't measure

Essential Reading

Resource	Format	Why It Matters
Your AI Product Needs Evals	Blog post	The best single piece on why evaluation is the core discipline of AI product engineering.
A Practical Guide to LLM Evals	hamel.dev	Step-by-step: what to measure, how to set up an eval harness, LLM-as-judge pitfalls.
Fine-tuning in Practice	Blog series	Real workflows: dataset curation, SFT, DPO, evaluation — not theoretical pipelines.
nbdev (FastAI)	Open-source	Notebook-driven development — his preferred environment for rapid ML experimentation.
hamel.dev	Blog	Ongoing: practical posts on what actually works in production ML, no hype.

What to Question

Hamel's emphasis on domain-specific fine-tuning is well-placed but sometimes understates how far a well-prompted frontier model can go without fine-tuning. His eval frameworks are opinionated — they work best in structured task settings and require more adaptation for open-ended generation tasks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →