Evaluation 9 min read

RAGAS Metrics Explained: What They Measure, What They Miss, and When They Lie

Faithfulness, answer relevancy, context precision, and context recall each measure a different failure mode — but each has blind spots. When 0.91 faithfulness means nothing, and how to build a composite eval that catches what RAGAS misses.

What RAGAS actually measures

RAGAS is the most widely used RAG evaluation framework. Most teams treat it as a quality score: 0.91 faithfulness sounds like a 91% pass rate. It is not. Each RAGAS metric measures a specific, narrow property of a specific component of the RAG pipeline — and each one has documented failure modes where it returns high scores on bad outputs. Understanding what the numbers mean is prerequisite to using them correctly.

There are four core RAGAS metrics. They are not interchangeable, they are not equally important for all use cases, and the right interpretation depends entirely on what your system is trying to do.

Faithfulness

Faithfulness measures whether the claims in the generated answer can be inferred from the retrieved context. It is defined as: number of claims in the answer that are grounded in the context, divided by the total number of claims in the answer.

What faithfulness catches: hallucinated facts that contradict the retrieved documents. A response that adds information not present in the context will have low faithfulness. A response that fabricates a statistic not in the context will have low faithfulness.

What faithfulness misses: it says nothing about whether the answer is actually correct. If the context itself contains incorrect information — stale documents, scraped errors, contradictions — a response that faithfully reproduces those errors will score 1.0 faithfulness. You have a faithful hallucination.

High faithfulness with low accuracy means your retrieved documents are wrong, not your model. Faithfulness only measures consistency between answer and context — it says nothing about whether the context or the answer reflects reality. Pair faithfulness with document quality audits.

Answer relevancy

Answer relevancy measures whether the generated answer addresses the user's question. It is computed using an indirect method: generate multiple candidate questions from the answer, then measure cosine similarity between those generated questions and the original question. A relevant answer should produce generated questions that are similar to what was asked.

What answer relevancy catches: evasive answers, off-topic responses, answers that technically contain true statements but don't address the question. A response to 'What is the capital of France?' that answers 'France is a country in Western Europe' would score low relevancy.

What answer relevancy misses: factual accuracy. An answer that is directly relevant and completely wrong will score high relevancy. The metric measures whether the answer is about the right topic, not whether it is true.

Answer relevancy and faithfulness are complementary. High relevancy + high faithfulness = answer addresses the question and is grounded in context. Neither metric catches factual errors in the context itself. Use both together, but add a factual accuracy check for high-stakes domains.

Context precision

Context precision measures what fraction of the retrieved chunks were actually relevant to answering the question. Retriever returned 5 chunks; 3 were relevant to the answer; context precision = 0.6.

What context precision catches: retrieval noise — irrelevant chunks that dilute the signal in the context window. Low context precision means your retriever is returning things that don't help, which wastes context budget and increases the chance of the model being distracted by irrelevant content.

What context precision misses: whether the relevant chunks were actually at the top of the ranked list. A retriever that returns the relevant chunk fourth out of five scores the same context precision as one that returns it first — but the second is a significantly better retriever because relevant content has higher rank.

Context recall

Context recall measures whether all the information needed to answer the question was present in the retrieved context. It requires a ground-truth answer: given what the ideal answer should contain, what fraction of those required facts appeared in the retrieved chunks?

What context recall catches: retrieval gaps — the retriever found some relevant content but missed a necessary piece. If the ideal answer requires three facts and only two were retrieved, context recall is 0.67.

What context recall misses: it requires ground-truth answers to compute, which means it cannot be computed without a labelled dataset. Teams running RAGAS on production queries without labels often skip context recall entirely — and then do not know whether their retriever is missing critical information.

Building a composite eval that actually works

Faithfulness + context precision together: high faithfulness + low context precision means the retriever is noisy but the model is ignoring the noise. Fix the retriever anyway — noise costs tokens and latency.
Low faithfulness + high context recall: the right documents were retrieved but the model is not using them. Check if context is being truncated, or if the model is over-weighting its parametric memory.
High faithfulness + low answer relevancy: the model is accurately reproducing content from context but not actually answering the question. Often a reranking problem — the wrong chunks are being used.
All metrics high, users unhappy: the metrics are measuring the right things but your test set does not represent real queries. Run a monthly sample of production queries through manual evaluation.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →