LLM-as-Judge: The Four Biases Your Evaluator Won't Tell You About
LLM judges are fast, scalable, and wrong in predictable ways. Self-preference bias, position bias, length bias, and verbosity bias — how each one inflates your eval scores and the calibration techniques that compensate for them.
The evaluator that grades its own homework
LLM-as-judge has become the default eval approach for teams building with LLMs. It is cheap, fast, and scalable. It is also wrong in predictable ways that most teams have never measured. The biases are not random — they are systematic, they compound over evaluation cycles, and they cause teams to ship models that score well and perform poorly.
There are four biases that appear consistently across models and tasks. Each one inflates scores in ways that do not reflect real quality. None of them disappears at larger model scale — GPT-4o as a judge has all four biases. Knowing what they are and how to compensate for them is the difference between an eval that catches regressions and one that masks them.
Bias 1: Self-preference
When you use an LLM as a judge and the candidate response was also generated by an LLM, the judge exhibits a preference for outputs that match its own generation style. GPT-4o scoring GPT-4o responses gives systematically higher scores than GPT-4o scoring Claude responses — not because the GPT-4o responses are better, but because the judge recognises its own patterns.
This is particularly damaging when the judge and the model under evaluation are from the same family or the same provider. A team using GPT-4 to generate responses and GPT-4o to evaluate them is measuring how well the model mimics itself, not how well it serves users.
Self-preference bias is largest when the judge and candidate share architecture. Use a different model family as your judge — Claude judging GPT-4o outputs, or GPT-4o judging Claude outputs. Cross-family judging is not perfect but substantially reduces this bias.
Bias 2: Position bias
When an LLM judge is presented with two responses and asked which is better, it shows a significant preference for the response in the first position. Studies have found this bias reverses when the order is swapped — the same response receives higher scores when placed first.
The bias is strong enough that position alone can swing win-rate metrics by 10–15 percentage points in pairwise evaluations. A model that scores as the clear winner in every A/B comparison may simply be the model that appeared first in the judge prompt more often.
Always run pairwise evals in both orders — (A, B) and (B, A) — and report the average. Only declare a winner when the win rate exceeds 50% in both orderings. Any win that reverses on order swap is a position bias artifact, not a quality signal.
Bias 3: Length bias
LLM judges consistently rate longer responses as higher quality, independent of content. A response that provides the correct answer in 50 words and one that provides the same correct answer in 200 words will receive different scores — the 200-word version scores higher, even though the additional content adds no factual value.
This matters because it creates an optimization pressure that has nothing to do with quality. If you optimize prompts or fine-tune against an LLM judge, you are implicitly rewarding verbosity. The result is responses that get longer over training iterations without becoming more accurate, more useful, or more honest.
Length bias is the primary driver of the 'response padding' failure mode — where fine-tuned models produce fluent, structured, lengthy responses that score well on automated evals and perform worse with actual users who wanted a short answer. Explicitly instruct your judge to penalize unnecessary length, or score length-normalized versions of responses.
Bias 4: Verbosity and format bias
Closely related to length bias: LLM judges prefer responses with structured formatting — bullet points, headers, numbered lists — over responses that give the same information in dense prose. A response with a markdown header and three bullets scores higher than an equivalent response in two sentences.
This is a problem because formatted responses are not always better responses. In conversational interfaces, heavy formatting looks robotic. In code generation, excessive comments score higher but are often less useful than clean code. In factual Q&A, bullet points fragment information that was more coherent as prose.
Calibration techniques that actually help
- Cross-family judging: always use a different model family than the one generating responses
- Pairwise order randomisation: run all pairwise comparisons in both orders, average the results
- Length-normalised scoring: separately score the core answer and then deduct for unnecessary padding
- Anti-verbosity instructions: explicitly tell the judge in its system prompt to score short correct answers as highly as long correct answers
- Human correlation checks: monthly, manually evaluate 50–100 responses the judge scored highly and confirm human agreement is above 80%. If the judge score and human score diverge, you have an uncalibrated bias
- Multi-judge consensus: run the same eval through two different judge models, report only consensus verdicts
When LLM judges are still worth using
None of the above means LLM judges are useless. For regression testing — detecting whether the current model is worse than the previous checkpoint on a fixed test set — LLM judges work well even with biases, because the biases are consistent and the signal is the delta between runs rather than the absolute score.
For prompt iteration — comparing two prompt variants on the same task — pairwise LLM judges with order randomisation are fast and reasonably reliable. The failure mode is using absolute LLM judge scores to make claims about quality or to compare across model families without accounting for systematic bias.
Use LLM judges for relative comparisons (is version N better than N-1?) and human judges for absolute quality gates (is this system good enough to ship?). The two answer different questions and neither replaces the other.
- Judging the Judges: Evaluating Alignment and Vulnerabilities of LLM-as-Judges
- Large Language Models Are Not Robust Multiple Choice Selectors
- Calibrating LLM-Based Evaluator Bias
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →