Evaluation 9 min read

Hallucination Detection: Why It's Hard and What Actually Works

Factual vs. faithfulness vs. citation hallucinations. NLI-based detection, self-consistency, and retrieval grounding — tested against real examples.

Hallucination is the most cited failure mode of LLMs — and also the most misunderstood. Not all hallucinations are the same. Detecting them requires different techniques depending on what type you're dealing with.

Three types of hallucination

Type	Definition	Example	Detection method
Factual	Model asserts a false real-world fact	"Einstein won the Nobel Prize in 1922" (it was 1921)	External knowledge base lookup
Faithfulness	Answer contradicts the provided context	Context says revenue was $4M, answer says $14M	NLI / entailment model
Citation	Model cites a source that doesn't support the claim (or doesn't exist)	Fabricated paper title/DOI	Source verification

NLI-based faithfulness detection

Natural Language Inference (NLI) models classify whether a hypothesis is entailed by, contradicted by, or neutral to a premise. For RAG, you can use an NLI model to check whether each claim in the model's answer is entailed by the retrieved context.

from transformers import pipeline

nli = pipeline("text-classification",
               model="cross-encoder/nli-deberta-v3-small")

context = "The company was founded in 2018 and went public in 2023."
claim   = "The company has been public since 2021."

result = nli(f"{context} [SEP] {claim}")
# Output: {'label': 'CONTRADICTION', 'score': 0.97}
# → flag this claim as a potential hallucination

Self-consistency as a hallucination signal

Generate the same answer multiple times with temperature > 0. If the model gives consistent answers, it's more likely to be correct. High variance across samples signals low confidence — a useful proxy for potential hallucination without needing ground truth.

Self-consistency works because hallucinations are often low-probability outputs. A hallucinated fact will be inconsistently stated across samples. A true fact tends to be stated consistently.

RAGAS faithfulness metric

RAGAS decomposes the model's answer into atomic claims and checks each claim against the retrieved context using an LLM-as-judge pattern. It produces a faithfulness score between 0 and 1. This is the most widely used RAG-specific hallucination metric in production.

Faithfulness: fraction of answer claims that are entailed by the retrieved context
Answer Relevancy: how well the answer addresses the actual question
Context Precision: fraction of retrieved context that's actually relevant
Context Recall: fraction of ground-truth information that's present in the retrieved context

Run RAGAS offline on a golden test set (100–200 hand-labelled Q&A pairs) every time you change your RAG pipeline. It's the fastest way to catch regressions before they reach users.

Spot hallucinations in Playground →: Feed the model contradictory context and see how faithfulness breaks down in real time.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →