Evaluation 10 min read

LLM-as-Judge: Calibration, Bias Modes, and When to Trust It

Position bias, verbosity bias, self-consistency bias, sycophancy. A structured judging rubric with JSON output, measuring judge-human agreement with Cohen's Kappa, and the cross-family judging rule that removes self-preference bias.

LLMs as Evaluators

LLM-as-judge (using one LLM to evaluate another's outputs) has become the dominant approach for evaluating generative AI systems. Human evaluation is expensive, slow, and doesn't scale to thousands of model iterations. LLM judges are cheap, fast, and surprisingly consistent — but they carry systematic biases that human judges do not.

If you use LLM-as-judge without understanding its failure modes, you're not evaluating your model — you're evaluating your judge's preferences.

Known Biases in LLM Judges

Position bias: LLM judges prefer responses that appear first in a pairwise comparison, or responses in specific positions within a list. Mitigation: run each comparison twice with positions swapped; only count as a win if the same response wins both times.
Verbosity bias: longer responses are rated higher regardless of quality. A verbose response that says less is preferred over a concise response that says more. Mitigation: limit response length in the prompt, or explicitly instruct the judge to evaluate information density.
Self-consistency bias: when an LLM judges its own outputs versus those of another model, it systematically prefers its own style. Use a different model family for judging.
Sycophancy: if the prompt contains any hint of which response is 'better', the judge will align with that hint. Keep the prompt neutral; don't label responses A (human) vs B (model).
Anchoring: the judge's rating of the second response in a sequence is anchored to the first. It rates relative to context rather than on an absolute scale.

Structured Judging Rubric

JUDGE_PROMPT_TEMPLATE = """
You are an expert evaluator. Assess the following response to the given question.
Do not consider which model produced it. Evaluate only the content.

Question: {question}

Response: {response}

Rate the response on each criterion from 1-5:

ACCURACY (1-5): Does it state correct information? Is anything false or misleading?
RELEVANCE (1-5): Does it address what was asked? Is anything irrelevant?
COMPLETENESS (1-5): Does it cover the key aspects? Are important aspects missing?
CLARITY (1-5): Is it well-organised and easy to understand?
CONCISENESS (1-5): Does it avoid unnecessary verbosity?

Output your ratings in this JSON format and nothing else:
{{
  "accuracy": <int 1-5>,
  "relevance": <int 1-5>,
  "completeness": <int 1-5>,
  "clarity": <int 1-5>,
  "conciseness": <int 1-5>,
  "brief_justification": "<one sentence per criterion, semicolon-separated>"
}}
"""

import json, re

def judge_response(question: str, response: str, judge_model) -> dict:
    prompt = JUDGE_PROMPT_TEMPLATE.format(question=question, response=response)
    raw = judge_model.generate(prompt)
    json_match = re.search(r'{.*}', raw, re.DOTALL)
    if not json_match:
        raise ValueError(f"Judge returned non-JSON: {raw[:200]}")
    scores = json.loads(json_match.group())
    scores["composite"] = sum([scores[k] for k in ["accuracy","relevance","completeness","clarity","conciseness"]]) / 5
    return scores

Calibrating Your Judge Against Human Labels

Before trusting an LLM judge at scale, measure its agreement with human annotations on a calibration set. Take 200 items that humans have rated; compute Cohen's Kappa between human ratings and judge ratings. Kappa > 0.6 is the bar for using the judge in production evaluations.

def calibrate_judge(judge_fn, human_ratings: list[dict]) -> dict:
    """
    human_ratings: list of {question, response, human_score_1_to_5}
    Returns calibration metrics.
    """
    judge_scores, human_scores = [], []
    
    for item in human_ratings:
        try:
            j = judge_fn(item["question"], item["response"])
            judge_scores.append(round(j["composite"]))
            human_scores.append(item["human_score"])
        except Exception as e:
            print(f"Judge failed on item: {e}")
    
    from scipy.stats import pearsonr, spearmanr
    pearson,  _ = pearsonr(judge_scores,  human_scores)
    spearman, _ = spearmanr(judge_scores, human_scores)
    
    return {
        "pearson_r":      round(pearson,  4),
        "spearman_rho":   round(spearman, 4),
        "mean_abs_error": round(abs(np.array(judge_scores) - np.array(human_scores)).mean(), 3),
        "n_judged":       len(judge_scores),
    }

The single most important practice: never use the same model family as judge and evaluatee. If you're evaluating GPT-4o outputs, use Claude as the judge. If you're evaluating Claude, use Gemini. Cross-family judging reduces self-preference bias dramatically.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →