GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Perf Calibration for AI Teams: Rubrics, Avoiding Bias, and the Calibration Session

Recency bias and visibility bias are the dominant failure modes in AI team calibration. A behavioral rubric with three dimensions: technical judgment, execution reliability, organizational impact. How to run the calibration session with evidence, not argument.

Calibration sessions are where performance ratings get set across the org. For AI teams, they're uniquely fraught: the work is hard to observe, outcomes are uncertain, and 'impact' in ML is easy to over-attribute to the loudest person in the room.

The Calibration Biases Specific to AI Teams

Before the calibration session, audit your own ratings. For each engineer, ask: am I rating their capability or their visibility? Am I rating the outcome or the process that produced it?

A Rubric That Works for AI Work

Abstract rubrics ('exceeds expectations') fail for AI work because they don't account for the research-vs-engineering spectrum. Use a behavioral rubric with three dimensions: technical judgment, execution reliability, and organizational impact.

Running the Calibration Session

The calibration session is not a debate. It is a structured process for surfacing evidence and resolving disagreements with shared criteria. The manager who comes in without written evidence for each rating loses the session to whoever argues loudest.

The Promotion Decision

Promotions should be boring if you've managed well: the engineer is already performing at the next level, the calibration session just makes it official. Surprise promotions and surprise non-promotions are both management failures.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →