AI Engineering 10 min read

Perf Calibration for AI Teams: Rubrics, Avoiding Bias, and the Calibration Session

Recency bias and visibility bias are the dominant failure modes in AI team calibration. A behavioral rubric with three dimensions: technical judgment, execution reliability, organizational impact. How to run the calibration session with evidence, not argument.

Calibration sessions are where performance ratings get set across the org. For AI teams, they're uniquely fraught: the work is hard to observe, outcomes are uncertain, and 'impact' in ML is easy to over-attribute to the loudest person in the room.

The Calibration Biases Specific to AI Teams

Recency bias: the engineer who fixed the production incident last month overshadows the one who quietly improved retrieval precision by 4 points over the quarter. Visibility bias: the engineer presenting at the all-hands gets rated higher than the one doing the foundational data work no one sees. Outcome vs. process bias: a shipped model improvement looks better than a rigorous negative result, even though the rigor is the real signal. Project assignment bias: engineers on high-visibility projects look better than equally talented engineers on maintenance work.

Before the calibration session, audit your own ratings. For each engineer, ask: am I rating their capability or their visibility? Am I rating the outcome or the process that produced it?

A Rubric That Works for AI Work

Abstract rubrics ('exceeds expectations') fail for AI work because they don't account for the research-vs-engineering spectrum. Use a behavioral rubric with three dimensions: technical judgment, execution reliability, and organizational impact.

Running the Calibration Session

The calibration session is not a debate. It is a structured process for surfacing evidence and resolving disagreements with shared criteria. The manager who comes in without written evidence for each rating loses the session to whoever argues loudest.

Prepare a one-pager per engineer: 3 behavioral examples, 1 growth area, proposed rating with rationale. Start with agreed cases (obvious exceeds, obvious meets) to build momentum. Challenge ratings that rely on anecdote rather than pattern. For disputed cases: ask 'what evidence would change your rating?' — if neither side can answer this, the evidence is insufficient. End with action: who gets what feedback, by when, from whom.

The Promotion Decision

Promotions should be boring if you've managed well: the engineer is already performing at the next level, the calibration session just makes it official. Surprise promotions and surprise non-promotions are both management failures.

Tell engineers explicitly when they are on a promotion trajectory and what evidence would confirm it. Tell engineers explicitly when they are not, and why — before the calibration session, not after. Don't nominate engineers for promotion without the behavioral evidence to defend it in the session. The hardest conversation: the senior engineer who will never make staff at your company. Have it early.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →