GenAI Systems Lab Open interactive version →
AI Engineering 7 min read

Hard Negatives: The Training Trick That Actually Improves Retrieval

Hard negatives — semantically close but intent-misaligned pairs — are the highest-ROI training signal for retrieval models. Three mining strategies with code.

If you've trained or fine-tuned a retrieval model and it still underperforms on your dataset, the problem is almost certainly your negatives. Specifically: your negatives are too easy. A model that's never had to distinguish between documents that look similar but mean different things will never learn to make that distinction. Hard negatives are the training signal that forces it to.

This post covers what hard negatives are, why they're the highest-leverage improvement you can make to retrieval training, and three concrete mining strategies you can implement today.

Easy negatives vs. hard negatives

A training example for a retrieval model consists of a query, a positive document (relevant), and one or more negative documents (not relevant). The quality of your negatives determines how much the model learns from each example.

Negative typeExampleWhat the model learns
Random/easyQuery: 'return policy' → Negative: a recipe for chocolate cakeAlmost nothing — the difference is obvious without learning
Semi-hardQuery: 'return policy' → Negative: a shipping policy documentSome discriminative features — different topic, similar domain
Hard negativeQuery: 'how do I return a damaged item?' → Negative: 'return policy for unopened items'Fine-grained intent distinction — same topic, different user need

Hard negatives are semantically close to the query but intent-misaligned. They share surface-level vocabulary and topic with the positive, but don't actually answer the user's question. Training on these forces the model to encode intent, not just topic similarity.

Why hard negatives matter for bi-encoders

Bi-encoders (the architecture used by models like sentence-transformers, OpenAI's text-embedding-3, and Cohere's embed models) encode queries and documents independently into dense vectors. Similarity is computed by dot product or cosine similarity at query time.

The fundamental problem: a bi-encoder trained only on easy negatives learns to separate topics, not intents. It will correctly retrieve documents about 'return policies' for a query about returns — but it won't correctly rank 'return damaged items' above 'return unopened items' for a user who needs the former. The vectors for these two documents are nearly identical in topic space but diverge in intent space. Hard negative training is what teaches the model to encode that intent distinction.

Cross-encoders, used in reranking, jointly encode the query and document and are inherently better at fine-grained relevance judgments. But they're too slow for first-stage retrieval. The practical pipeline is: bi-encoder retrieval (fast, approximate) → cross-encoder reranking (slow, precise). Hard negative training improves the bi-encoder so fewer relevant documents are missed before reranking even runs.

Mining strategy 1: In-batch negatives

The simplest strategy: treat every other positive in the batch as a negative for each query. If your batch contains 64 (query, positive) pairs, each query gets 63 in-batch negatives.

This works because a large, well-sampled batch will contain documents that are topically similar to any given query's positive — making them semi-hard to hard negatives naturally. The MultipleNegativesRankingLoss (used in sentence-transformers) implements this directly.

from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Each InputExample: (query, positive_doc) — negatives are all other positives in the batch
train_examples = [
    InputExample(texts=["how do I return a damaged item?", "Return policy for damaged goods: ..."]),
    InputExample(texts=["return policy for gifts", "Gift return policy: items can be returned within 60 days..."]),
    # ... more (query, positive) pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=64)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
)

Limitation: in-batch negatives are only as hard as your batch is large and topically diverse. With small batches or narrow corpora, most in-batch negatives are still easy.

Mining strategy 2: BM25 top-k negatives

BM25 retrieves documents that share keywords with the query but may not be semantically relevant. These are excellent hard negatives: they look like they match (keyword overlap), but the bi-encoder needs to learn they don't actually answer the query.

from rank_bm25 import BM25Okapi
import numpy as np

def mine_bm25_hard_negatives(queries, positives, corpus, top_k=20):
    """
    queries: list of query strings
    positives: list of sets of positive doc IDs per query
    corpus: list of (doc_id, doc_text) tuples
    """
    tokenized_corpus = [doc_text.split() for _, doc_text in corpus]
    bm25 = BM25Okapi(tokenized_corpus)
    doc_ids = [doc_id for doc_id, _ in corpus]

    hard_negatives = []
    for query, pos_ids in zip(queries, positives):
        scores = bm25.get_scores(query.split())
        top_k_indices = np.argsort(scores)[::-1][:top_k]
        top_k_doc_ids = [doc_ids[i] for i in top_k_indices]
        # Exclude known positives
        negatives = [doc_id for doc_id in top_k_doc_ids if doc_id not in pos_ids]
        hard_negatives.append(negatives[:5])  # Take top 5 hard negatives per query

    return hard_negatives

Mining strategy 3: Teacher LLM labeling

The highest-quality hard negatives come from using a strong model (a cross-encoder or an LLM) to identify which retrieved documents are near-misses. The workflow:

This iterative approach — mine with current model, label with teacher, retrain — is called iterative hard negative mining and is the method used by state-of-the-art retrieval systems like the ones behind E5, BGE, and GTE models.

Practical tips and common mistakes

The single highest-ROI change you can make to a retrieval system: switch from random negatives to BM25-mined hard negatives. In most domain-specific fine-tuning experiments, this alone improves NDCG@10 by 8–15 points without changing model architecture, training duration, or dataset size.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →