Evaluation 10 min read

NDCG and MRR From Scratch: The Ranking Metrics Every AI Engineer Needs

Why accuracy is the wrong metric for search and recommendation. MRR for navigational queries (first correct result wins), DCG for graded relevance, NDCG for cross-query comparison. Full Python implementation with the math explained.

Why Accuracy Is the Wrong Metric for Search and Recommendation

A classifier is right or wrong. A ranker returns a list, and the list's quality depends on what appears at the top — not just whether the right items appear at all. A ranking where the correct result appears at position 10 is objectively worse than one where it appears at position 1, even if both have 100% recall@10.

Two metrics dominate search, recommendation, and retrieval evaluation: NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank). Both are ranking-aware. Both penalize relevant results buried deep in the list.

MRR: Mean Reciprocal Rank

MRR is the average reciprocal of the rank at which the first relevant result appears across all queries. If for query 1 the first relevant result is at rank 3, the reciprocal rank is 1/3. If for query 2 it's at rank 1, the reciprocal rank is 1. Average over all queries.

def reciprocal_rank(ranked_items: list, relevant_items: set) -> float:
    """
    Returns the reciprocal rank of the first relevant item.
    1-indexed: first position has rank 1, not 0.
    """
    for rank, item in enumerate(ranked_items, start=1):
        if item in relevant_items:
            return 1.0 / rank
    return 0.0   # no relevant item found in the list

def mrr(queries: list[tuple[list, set]]) -> float:
    """queries: list of (ranked_items, relevant_items) tuples"""
    return sum(reciprocal_rank(ranked, rel) for ranked, rel in queries) / len(queries)

# Example
ranked_results = [
    (["doc_c", "doc_a", "doc_b"], {"doc_a", "doc_b"}),   # first relevant at rank 2 → RR = 0.5
    (["doc_x", "doc_y", "doc_a"], {"doc_a"}),             # first relevant at rank 3 → RR = 0.333
    (["doc_a", "doc_b", "doc_c"], {"doc_a", "doc_c"}),    # first relevant at rank 1 → RR = 1.0
]
print(f"MRR = {mrr(ranked_results):.4f}")   # (0.5 + 0.333 + 1.0) / 3 = 0.611

MRR is best for navigational queries where users want exactly one result — finding a specific page, a known artist, a product by name. It completely ignores what happens after the first relevant result, so it's a poor fit for exploratory queries where breadth matters.

DCG and NDCG: Graded Relevance

MRR treats relevance as binary. DCG (Discounted Cumulative Gain) handles graded relevance — a rating of 3 ('very relevant') is worth more than a rating of 1 ('somewhat relevant'). It also rewards multiple relevant results, not just the first.

import numpy as np

def dcg_at_k(relevances: list[float], k: int) -> float:
    """
    Discounted Cumulative Gain at K.
    relevances: graded relevance scores for ranked items (index 0 = rank 1)
    Standard formula: sum( rel_i / log2(i + 1) ) for i in 1..k
    """
    relevances = np.array(relevances[:k], dtype=float)
    if len(relevances) == 0:
        return 0.0
    discounts = np.log2(np.arange(2, len(relevances) + 2))   # log2(2), log2(3), ..., log2(k+1)
    return np.sum(relevances / discounts)

def ndcg_at_k(relevances: list[float], k: int) -> float:
    """
    Normalized DCG at K.
    IDCG = DCG of the ideal (perfectly sorted) ranking.
    NDCG = DCG / IDCG  → range [0, 1]
    """
    dcg  = dcg_at_k(relevances, k)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    return dcg / idcg if idcg > 0 else 0.0

# Example: 4 retrieved items with graded relevance [3, 1, 0, 2]
# Ideal order would be [3, 2, 1, 0]
relevances = [3, 1, 0, 2]
print(f"DCG@4  = {dcg_at_k(relevances, 4):.4f}")
print(f"NDCG@4 = {ndcg_at_k(relevances, 4):.4f}")   # 0.795 (not perfect — 2 is buried at rank 4)

# Batch NDCG over multiple queries
def mean_ndcg(query_results: list[tuple[list, list]], k: int) -> float:
    """query_results: list of (system_ranking_ids, [(id, relevance)] judgments)"""
    scores = []
    for ranked_ids, judgments in query_results:
        judgment_map = dict(judgments)
        relevances_in_rank_order = [judgment_map.get(id_, 0) for id_ in ranked_ids[:k]]
        all_relevances = [rel for _, rel in judgments]
        scores.append(ndcg_at_k(relevances_in_rank_order + all_relevances, k))
    return np.mean(scores)

Which Metric to Use

NDCG@10 is the de facto standard for web search evaluation. MRR@10 is preferred for QA and entity retrieval. Precision@K is still used in IR research. In recommendation, NDCG@20 or NDCG@50 are common because users scroll more.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →