AI Engineering 7 min read

Hard Negatives: The Training Trick That Actually Improves Retrieval

Hard negatives — semantically close but intent-misaligned pairs — are the highest-ROI training signal for retrieval models. Three mining strategies with code.

If you've trained or fine-tuned a retrieval model and it still underperforms on your dataset, the problem is almost certainly your negatives. Specifically: your negatives are too easy. A model that's never had to distinguish between documents that look similar but mean different things will never learn to make that distinction. Hard negatives are the training signal that forces it to.

This post covers what hard negatives are, why they're the highest-leverage improvement you can make to retrieval training, and three concrete mining strategies you can implement today.

Easy negatives vs. hard negatives

A training example for a retrieval model consists of a query, a positive document (relevant), and one or more negative documents (not relevant). The quality of your negatives determines how much the model learns from each example.

Negative type	Example	What the model learns
Random/easy	Query: 'return policy' → Negative: a recipe for chocolate cake	Almost nothing — the difference is obvious without learning
Semi-hard	Query: 'return policy' → Negative: a shipping policy document	Some discriminative features — different topic, similar domain
Hard negative	Query: 'how do I return a damaged item?' → Negative: 'return policy for unopened items'	Fine-grained intent distinction — same topic, different user need

Hard negatives are semantically close to the query but intent-misaligned. They share surface-level vocabulary and topic with the positive, but don't actually answer the user's question. Training on these forces the model to encode intent, not just topic similarity.

Why hard negatives matter for bi-encoders

Bi-encoders (the architecture used by models like sentence-transformers, OpenAI's text-embedding-3, and Cohere's embed models) encode queries and documents independently into dense vectors. Similarity is computed by dot product or cosine similarity at query time.

The fundamental problem: a bi-encoder trained only on easy negatives learns to separate topics, not intents. It will correctly retrieve documents about 'return policies' for a query about returns — but it won't correctly rank 'return damaged items' above 'return unopened items' for a user who needs the former. The vectors for these two documents are nearly identical in topic space but diverge in intent space. Hard negative training is what teaches the model to encode that intent distinction.

Cross-encoders, used in reranking, jointly encode the query and document and are inherently better at fine-grained relevance judgments. But they're too slow for first-stage retrieval. The practical pipeline is: bi-encoder retrieval (fast, approximate) → cross-encoder reranking (slow, precise). Hard negative training improves the bi-encoder so fewer relevant documents are missed before reranking even runs.

Mining strategy 1: In-batch negatives

The simplest strategy: treat every other positive in the batch as a negative for each query. If your batch contains 64 (query, positive) pairs, each query gets 63 in-batch negatives.

This works because a large, well-sampled batch will contain documents that are topically similar to any given query's positive — making them semi-hard to hard negatives naturally. The MultipleNegativesRankingLoss (used in sentence-transformers) implements this directly.

from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Each InputExample: (query, positive_doc) — negatives are all other positives in the batch
train_examples = [
    InputExample(texts=["how do I return a damaged item?", "Return policy for damaged goods: ..."]),
    InputExample(texts=["return policy for gifts", "Gift return policy: items can be returned within 60 days..."]),
    # ... more (query, positive) pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=64)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
)

Limitation: in-batch negatives are only as hard as your batch is large and topically diverse. With small batches or narrow corpora, most in-batch negatives are still easy.

Mining strategy 2: BM25 top-k negatives

BM25 retrieves documents that share keywords with the query but may not be semantically relevant. These are excellent hard negatives: they look like they match (keyword overlap), but the bi-encoder needs to learn they don't actually answer the query.

For each training query, run BM25 over your corpus
Retrieve the top-k documents (e.g., top 20)
Remove any documents that are known positives
Use the remaining BM25 hits as hard negatives in training

from rank_bm25 import BM25Okapi
import numpy as np

def mine_bm25_hard_negatives(queries, positives, corpus, top_k=20):
    """
    queries: list of query strings
    positives: list of sets of positive doc IDs per query
    corpus: list of (doc_id, doc_text) tuples
    """
    tokenized_corpus = [doc_text.split() for _, doc_text in corpus]
    bm25 = BM25Okapi(tokenized_corpus)
    doc_ids = [doc_id for doc_id, _ in corpus]

    hard_negatives = []
    for query, pos_ids in zip(queries, positives):
        scores = bm25.get_scores(query.split())
        top_k_indices = np.argsort(scores)[::-1][:top_k]
        top_k_doc_ids = [doc_ids[i] for i in top_k_indices]
        # Exclude known positives
        negatives = [doc_id for doc_id in top_k_doc_ids if doc_id not in pos_ids]
        hard_negatives.append(negatives[:5])  # Take top 5 hard negatives per query

    return hard_negatives

Mining strategy 3: Teacher LLM labeling

The highest-quality hard negatives come from using a strong model (a cross-encoder or an LLM) to identify which retrieved documents are near-misses. The workflow:

Retrieve top-100 documents per query using your current bi-encoder
Score each (query, document) pair with a cross-encoder or LLM judge
Documents scored 0.3–0.7 (uncertain relevance) are your best hard negatives — close enough to be confusing, but ultimately not relevant
Documents scored >0.8 are additional positives — add them to your positive set
Use the 0.3–0.7 range documents as hard negatives in the next training round

This iterative approach — mine with current model, label with teacher, retrain — is called iterative hard negative mining and is the method used by state-of-the-art retrieval systems like the ones behind E5, BGE, and GTE models.

Practical tips and common mistakes

Don't use too many hard negatives per query early in training. Start with 1–2 hard negatives and increase as the model improves. Too many hard negatives too early causes training instability — the loss spikes and the model collapses.
Always verify your positives before mining negatives. A noisy positive set means some of your 'hard negatives' are actually positives — this confuses the model significantly.
Mix easy and hard negatives in each batch. A ratio of roughly 70% in-batch negatives + 30% mined hard negatives tends to work well. Pure hard negative training can make the model overfit to the specific patterns in your mined set.
Evaluate on a held-out set with hard negatives too. Standard retrieval evals use random negatives, which makes models look better than they are. Build an eval set that includes BM25-retrieved false positives to measure real retrieval quality.

The single highest-ROI change you can make to a retrieval system: switch from random negatives to BM25-mined hard negatives. In most domain-specific fine-tuning experiments, this alone improves NDCG@10 by 8–15 points without changing model architecture, training duration, or dataset size.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →