Evaluation 10 min read

Inter-Annotator Agreement: Cohen's Kappa, Krippendorff's Alpha, and Why Low IAA Is a Model Problem

If annotators disagree 30% of the time, your model's ceiling is below 70%. Cohen's Kappa (chance-corrected agreement for two annotators), Krippendorff's Alpha (multi-annotator, ordinal data), and how to design an annotation pipeline that produces reliable labels.

Why Annotation Quality Is a Model Quality Problem

Garbage in, garbage out is a cliché. The precise version: if your annotators disagree with each other 30% of the time, your model's ceiling accuracy on that task is somewhere below 70% — regardless of architecture, data volume, or training time. The noise floor in your labels becomes a hard ceiling on model performance.

Inter-annotator agreement (IAA) measures how consistently multiple annotators produce the same label for the same item. Low IAA tells you one of three things: the task definition is ambiguous, the annotators are insufficiently trained, or the task is genuinely hard and requires richer label structures.

Cohen's Kappa: Agreement Beyond Chance

Raw agreement (proportion of items where annotators agree) is misleading. Two annotators randomly labeling a dataset with 90% class imbalance will agree 81% of the time by chance alone. Cohen's Kappa corrects for chance agreement.

import numpy as np
from collections import Counter

def cohens_kappa(labels_a: list, labels_b: list) -> float:
    """
    Cohen's Kappa: chance-corrected agreement.
    κ = (P_o - P_e) / (1 - P_e)
    P_o = observed agreement
    P_e = expected agreement by chance
    
    Interpretation:
      κ < 0:    worse than chance
      0–0.2:    slight agreement
      0.2–0.4:  fair agreement
      0.4–0.6:  moderate agreement
      0.6–0.8:  substantial agreement
      0.8–1.0:  almost perfect agreement
    """
    assert len(labels_a) == len(labels_b)
    n = len(labels_a)
    
    # Observed agreement
    p_observed = sum(a == b for a, b in zip(labels_a, labels_b)) / n
    
    # Expected agreement by chance
    classes = set(labels_a) | set(labels_b)
    freq_a = Counter(labels_a)
    freq_b = Counter(labels_b)
    p_expected = sum((freq_a[c] / n) * (freq_b[c] / n) for c in classes)
    
    if p_expected == 1.0:
        return 1.0
    return (p_observed - p_expected) / (1 - p_expected)

# Example: sentiment labeling task
a = ["pos", "neg", "pos", "neg", "pos", "pos", "neg", "pos"]
b = ["pos", "neg", "neg", "neg", "pos", "pos", "neg", "neg"]
print(f"Raw agreement: {sum(x==y for x,y in zip(a,b))/len(a):.2f}")
print(f"Cohen's Kappa: {cohens_kappa(a, b):.4f}")

Krippendorff's Alpha: Multi-Annotator and Ordinal Data

Cohen's Kappa is designed for exactly two annotators with nominal labels. Krippendorff's Alpha generalises to: any number of annotators, nominal/ordinal/interval/ratio scales, and items with missing annotations. It's the standard for NLP annotation studies.

def krippendorffs_alpha(annotations: list[list], metric: str = "nominal") -> float:
    """
    annotations: outer list = annotators, inner list = their labels per item.
    Use None for missing values.
    metric: 'nominal' | 'ordinal' | 'interval'
    """
    import itertools
    # Flatten to (annotator, item, value) format and compute D_o and D_e
    # Full implementation: pip install krippendorff
    # krippendorff.alpha(annotations, level_of_measurement=metric)
    pass

# In practice, use the library:
# pip install krippendorff
import krippendorff
data = [
    [1, 2, 3, 3, 2, 1, 4, 1],  # annotator 1
    [1, 2, 3, 3, 2, 2, 4, 1],  # annotator 2
    [None, 2, 3, 4, 2, 1, 4, 1],  # annotator 3 (missing first item)
]
alpha = krippendorff.alpha(data, level_of_measurement="ordinal")
print(f"Krippendorff Alpha = {alpha:.4f}")

Annotation Pipeline Design

Write explicit annotation guidelines before collecting any labels. Ambiguous guidelines produce low IAA. Test your guidelines: pick 50 items, have two experienced team members label them, measure Kappa. A Kappa < 0.6 means iterate on guidelines before scaling.
Always collect redundant annotations for a sample. For a 10,000-item dataset, have at least 1,000 items labeled by 3+ annotators to measure IAA.
Use majority vote or adjudication for final labels. Don't just take annotator 1's label. For high-stakes tasks, flag items with 3+ annotators who all disagree for expert review.
Track annotator drift over time. An annotator who produces Kappa 0.85 in week 1 might drift to 0.60 in week 8. Monitor per-annotator agreement with a gold standard set interspersed in batches.
Label difficulty is signal. Items with low agreement are genuinely hard. These are often the most informative training examples and the ones your model will struggle most with. Give them extra attention.

If Cohen's Kappa on your task is below 0.4 after two rounds of guideline iteration, you should question whether the label scheme is appropriate. It may be that a softer label (probability distribution across classes) or a richer annotation structure (span + class + intensity) will produce more consistent annotations than forcing a binary choice.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →