GenAI Systems Lab Open interactive version →
Evaluation 10 min read

LLM-as-Judge: Calibration, Bias Modes, and When to Trust It

Position bias, verbosity bias, self-consistency bias, sycophancy. A structured judging rubric with JSON output, measuring judge-human agreement with Cohen's Kappa, and the cross-family judging rule that removes self-preference bias.

LLMs as Evaluators

LLM-as-judge (using one LLM to evaluate another's outputs) has become the dominant approach for evaluating generative AI systems. Human evaluation is expensive, slow, and doesn't scale to thousands of model iterations. LLM judges are cheap, fast, and surprisingly consistent — but they carry systematic biases that human judges do not.

If you use LLM-as-judge without understanding its failure modes, you're not evaluating your model — you're evaluating your judge's preferences.

Known Biases in LLM Judges

Structured Judging Rubric

JUDGE_PROMPT_TEMPLATE = """
You are an expert evaluator. Assess the following response to the given question.
Do not consider which model produced it. Evaluate only the content.

Question: {question}

Response: {response}

Rate the response on each criterion from 1-5:

ACCURACY (1-5): Does it state correct information? Is anything false or misleading?
RELEVANCE (1-5): Does it address what was asked? Is anything irrelevant?
COMPLETENESS (1-5): Does it cover the key aspects? Are important aspects missing?
CLARITY (1-5): Is it well-organised and easy to understand?
CONCISENESS (1-5): Does it avoid unnecessary verbosity?

Output your ratings in this JSON format and nothing else:
{{
  "accuracy": <int 1-5>,
  "relevance": <int 1-5>,
  "completeness": <int 1-5>,
  "clarity": <int 1-5>,
  "conciseness": <int 1-5>,
  "brief_justification": "<one sentence per criterion, semicolon-separated>"
}}
"""

import json, re

def judge_response(question: str, response: str, judge_model) -> dict:
    prompt = JUDGE_PROMPT_TEMPLATE.format(question=question, response=response)
    raw = judge_model.generate(prompt)
    json_match = re.search(r'{.*}', raw, re.DOTALL)
    if not json_match:
        raise ValueError(f"Judge returned non-JSON: {raw[:200]}")
    scores = json.loads(json_match.group())
    scores["composite"] = sum([scores[k] for k in ["accuracy","relevance","completeness","clarity","conciseness"]]) / 5
    return scores

Calibrating Your Judge Against Human Labels

Before trusting an LLM judge at scale, measure its agreement with human annotations on a calibration set. Take 200 items that humans have rated; compute Cohen's Kappa between human ratings and judge ratings. Kappa > 0.6 is the bar for using the judge in production evaluations.

def calibrate_judge(judge_fn, human_ratings: list[dict]) -> dict:
    """
    human_ratings: list of {question, response, human_score_1_to_5}
    Returns calibration metrics.
    """
    judge_scores, human_scores = [], []
    
    for item in human_ratings:
        try:
            j = judge_fn(item["question"], item["response"])
            judge_scores.append(round(j["composite"]))
            human_scores.append(item["human_score"])
        except Exception as e:
            print(f"Judge failed on item: {e}")
    
    from scipy.stats import pearsonr, spearmanr
    pearson,  _ = pearsonr(judge_scores,  human_scores)
    spearman, _ = spearmanr(judge_scores, human_scores)
    
    return {
        "pearson_r":      round(pearson,  4),
        "spearman_rho":   round(spearman, 4),
        "mean_abs_error": round(abs(np.array(judge_scores) - np.array(human_scores)).mean(), 3),
        "n_judged":       len(judge_scores),
    }

The single most important practice: never use the same model family as judge and evaluatee. If you're evaluating GPT-4o outputs, use Claude as the judge. If you're evaluating Claude, use Gemini. Cross-family judging reduces self-preference bias dramatically.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →