AI Engineering 11 min read

Learning to Rank: Pointwise, Pairwise, and Why LambdaMART Won

Ranking as regression vs. pair ordering vs. list optimisation. RankNet pairwise loss, NDCG@K from scratch, and LambdaMART as the production standard. What features go in a real LTR model and how to iterate on them.

BM25 and dense retrieval produce scores. Those scores rank documents. But the ranking those scores produce is not the ranking a human evaluator would produce — it optimises term frequency and vector similarity, not relevance to the user's actual intent. Learning to Rank (LTR) trains a model to produce the ranking directly, using features from the query, the document, and their interaction. It is the standard production technique at every large search engine.

Three problem formulations

Pointwise: treat ranking as regression or classification over individual documents. Train a model to predict a relevance score for a (query, document) pair. Rank by predicted score at query time. Advantage: simple. Disadvantage: optimises absolute scores, not ordering — a model can have good MAE but produce terrible rankings. Pairwise: train the model to correctly order pairs of documents: given (query, doc_A, doc_B), which should rank higher? RankNet and LambdaRank use pairwise objectives. Listwise: directly optimise the ranking metric (NDCG, MAP) over the full ranked list. LambdaMART and ListNet use listwise objectives. Best quality, most complex.

import numpy as np

# ─── Pairwise loss (RankNet) ──────────────────────────────────────────────────
def ranknet_loss(score_i, score_j, label_ij):
    """
    label_ij = 1  if document i should rank higher than j
    label_ij = 0  if equal
    label_ij = -1 if document j should rank higher
    P_ij = sigmoid(s_i - s_j) = probability model says i > j
    """
    def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
    p_ij = sigmoid(score_i - score_j)
    t = (label_ij + 1) / 2                 # convert {-1,0,1} to {0, 0.5, 1}
    # Cross-entropy
    loss = -(t * np.log(p_ij + 1e-9) + (1-t) * np.log(1 - p_ij + 1e-9))
    return loss

# ─── NDCG@K implementation ────────────────────────────────────────────────────
def dcg_at_k(relevances, k):
    """Discounted Cumulative Gain at K."""
    r = np.array(relevances[:k], dtype=float)
    if r.size == 0:
        return 0.0
    # Gain / log2(rank+1). Rank starts at 1 → log2(2), log2(3), ...
    discounts = 1.0 / np.log2(np.arange(2, r.size + 2))
    return (r * discounts).sum()

def ndcg_at_k(relevances, k):
    """Normalised DCG: DCG / ideal DCG."""
    ideal = sorted(relevances, reverse=True)
    ideal_dcg = dcg_at_k(ideal, k)
    if ideal_dcg == 0:
        return 0.0
    return dcg_at_k(relevances, k) / ideal_dcg

# ── Demo: compare two rankings ────────────────────────────────────────────────
# Relevance labels for 5 documents (0=irrelevant, 1=relevant, 2=highly relevant)
true_labels = [2, 1, 0, 2, 1]  # in original order

# Ranking A: nearly perfect (puts both 2s first)
ranking_A = [0, 3, 1, 4, 2]    # sorted positions by score_A
# Ranking B: poor (irrelevant first)
ranking_B = [2, 4, 1, 0, 3]

relevances_A = [true_labels[i] for i in ranking_A]
relevances_B = [true_labels[i] for i in ranking_B]

for k in [1, 3, 5]:
    print(f"NDCG@{k}  |  Ranking A: {ndcg_at_k(relevances_A, k):.3f}  |  Ranking B: {ndcg_at_k(relevances_B, k):.3f}")

# ── Pairwise loss on the same data ────────────────────────────────────────────
print("
RankNet pairwise losses for Ranking B (wrong pairs should have high loss):")
scores_B = [0.1, 0.6, 0.3, 0.8, 0.5]    # model scores for docs 0-4 (bad ranking)
true_labels = [2, 1, 0, 2, 1]
for i in range(len(scores_B)):
    for j in range(i+1, len(scores_B)):
        label = np.sign(true_labels[i] - true_labels[j])
        loss  = ranknet_loss(scores_B[i], scores_B[j], label)
        if loss > 0.4:
            print(f"  doc{i}(rel={true_labels[i]}, score={scores_B[i]}) vs doc{j}(rel={true_labels[j]}, score={scores_B[j]}): loss={loss:.3f}")

LambdaMART: the production standard

LambdaMART (Burges et al., 2010) is the most widely deployed LTR algorithm. It trains a gradient boosted tree ensemble where each tree is fitted to 'lambda gradients' — a modified gradient that upweights pairs whose swap would most improve NDCG. The key insight: directly computing the gradient of NDCG is infeasible (it is not differentiable), but you can approximate it by weighting pairwise gradients by the expected NDCG improvement from swapping each pair. XGBoost and LightGBM both implement rank:pairwise and rank:ndcg objectives — equivalent to LambdaMART in practice.

Features in a production LTR model

Document features: BM25 score, dense retrieval score, document freshness, document click-through rate, page authority. Query features: query length, query type (navigational/informational), user intent classification. Query-document interaction features: BM25F per field, exact match count, URL keyword match, anchor text match. Cross-encoder score (if available). LTR models in production typically use 50-200 features. The gradient-boosted tree handles non-linear combinations automatically — no feature engineering beyond knowing what to include.

Start with a baseline: train LightGBM ranker on (BM25 score, dense score, query length, document length) with NDCG@10 as objective. Measure offline NDCG@10. Then add one feature group at a time (freshness, click signals, exact match). Track NDCG@10 improvement per feature group. This is how real search teams iterate on ranking — not tuning BM25 parameters, but building the feature set that captures the dimensions of relevance their users care about.

From RankNet to LambdaRank to LambdaMART: An Overview — Burges (2010)

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →