Bias in LLM Outputs: Sources, Types, and What You Can Detect
Training data bias, demographic representation, positional bias in RAG, and confirmation bias in reasoning. How to surface and measure these in your system.
LLMs don't generate bias from nowhere. They learn it from us — from the text we wrote, the decisions we recorded, the stories we told. The uncomfortable truth is that an LLM trained on the internet will reflect the internet: its brilliance and its prejudices, its expertise and its blind spots.
This isn't a reason to not build with LLMs. It's a reason to build with your eyes open — to know the types of bias, where they come from, and what you can actually detect and mitigate versus what requires ongoing human oversight.
Types of bias in LLM outputs
| Type | What it looks like | Example |
|---|---|---|
| Representation bias | Under- or over-representation of groups in training data | Model defaults to male pronouns for 'engineer', female for 'nurse' |
| Stereotype amplification | Model exaggerates group patterns beyond what training data shows | Consistently associates certain ethnicities with crime in creative writing |
| Performance disparity | Model quality degrades for certain languages/dialects/accents | Weaker reasoning in African American Vernacular English vs. Standard American English |
| Allocation bias | Model systematically advantages or disadvantages groups in decisions | Resume screener rates equivalent CVs lower for certain names |
| Sycophancy | Model agrees with the user's apparent beliefs regardless of truth | Changes its assessment of a political claim when told which party the user supports |
| Recency/salience bias | Over-weights recent or frequently-discussed events | Assumes every business is a tech startup if context is ambiguous |
Where bias enters
Training data
The web over-represents English, over-represents wealthy countries, over-represents male voices in certain domains, and contains historical text from periods with explicit discrimination. A model trained on this data learns these patterns as features, not bugs — unless explicit effort is made to counteract them.
RLHF and fine-tuning
Human feedback is not neutral. Annotators have their own cultural backgrounds, language preferences, and implicit assumptions about what a 'good' answer looks like. If the annotator pool is not diverse, RLHF can encode a narrow view of quality. Some alignment research suggests RLHF may amplify sycophancy — the model learns to please, not to be accurate.
Your prompt and context
Priming effects are real. Prompts that mention certain groups, use certain frames, or carry implicit assumptions shift model outputs measurably. An evaluation task described as 'written by a student in a disadvantaged school' generates harsher feedback than the identical essay described neutrally.
What you can detect
# Test if model treats equivalent CVs differently based on perceived demographics
NAMES_SET_A = ["Emily Walsh", "Michael Johnson", "Sarah Chen"]
NAMES_SET_B = ["Lakisha Washington", "Jamal Williams", "María García"]
def audit_bias(resume_template, evaluation_prompt):
results = {}
for name_a, name_b in zip(NAMES_SET_A, NAMES_SET_B):
resume_a = resume_template.replace("{NAME}", name_a)
resume_b = resume_template.replace("{NAME}", name_b)
score_a = llm(evaluation_prompt + resume_a)
score_b = llm(evaluation_prompt + resume_b)
results[f"{name_a} vs {name_b}"] = {
"score_a": extract_score(score_a),
"score_b": extract_score(score_b),
"delta": extract_score(score_a) - extract_score(score_b)
}
return results
The 'Are Emily and Lakisha scored the same?' test is not a comprehensive bias audit. It catches one dimension of one type of bias. Real bias auditing is multi-dimensional, ongoing, and requires domain expertise. A passing pairwise test does not mean your system is unbiased.
Mitigations that actually work
- Explicit fairness instructions in your system prompt: 'Evaluate candidates solely on their stated qualifications, disregarding names, schools, or any demographic indicators'
- Output filtering: screen model outputs for slurs, stereotypes, and discriminatory content before returning to users
- Diverse annotator pools for any fine-tuning or RLHF — explicitly recruit for demographic, cultural, and linguistic diversity
- Regular bias audits: the pairwise substitution test is a starting point; run it monthly on your production system
- Human review for high-stakes decisions: never let an LLM make final employment, credit, or healthcare decisions without human oversight
- Performance audits across user segments: if your product serves diverse users, check whether quality metrics differ by segment
Run a bias audit →: Test your prompts for systematic disparities in the Playground module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →