Evaluation 11 min read

How to Evaluate LLM Systems: RAGAS, G-Eval, and Custom Grading

What groundedness, faithfulness, and citation accuracy actually measure — and how to build an eval pipeline that catches real failures before users do.

**Readable at any point in the path.** After this post you'll know how to measure whether your AI system is actually working: human evals, LLM-as-judge, automated metrics — and the conditions under which each one lies to you.

You can't eyeball your way to production-ready LLM systems. Human review doesn't scale. You need automated evaluation pipelines that catch regressions before users do. This is the hardest part of LLMOps — and the most skipped.

Why LLM eval is hard

No single correct answer: unlike classification, there are many valid responses to most prompts
Ground truth is expensive: labelling 1,000 examples costs real time and money
Metrics don't align with quality: BLEU and ROUGE are cheap but don't capture what users care about
Distributional shift: your eval set goes stale as user behaviour evolves

LLM-as-judge: the practical standard

Use a strong model (GPT-4o, Claude Opus) to grade the outputs of your weaker production model. The judge scores on dimensions like correctness, faithfulness, relevance, and tone. This scales infinitely and correlates well with human judgement (0.8+ on most benchmarks).

JUDGE_PROMPT = """
You are grading an AI assistant's response.

Question: {question}
Retrieved context: {context}
Assistant response: {response}

Grade on:
1. Faithfulness (0-1): Is every claim in the response supported by the context?
2. Relevance (0-1): Does the response actually answer the question?
3. Completeness (0-1): Does the response cover all key points from the context?

Respond in JSON: {"faithfulness": 0.X, "relevance": 0.X, "completeness": 0.X, "reason": "..."}
"""

RAGAS: the RAG evaluation framework

RAGAS (Retrieval-Augmented Generation Assessment) provides four metrics that together cover the full RAG pipeline. Run it on a golden test set of 100–200 labelled Q&A pairs before every pipeline change.

Metric	What it measures	Formula
Faithfulness	Are answer claims supported by context?	Supported claims / Total claims
Answer Relevancy	Does the answer address the question?	Embedding sim(question, answer)
Context Precision	Is retrieved context actually relevant?	Relevant chunks / Retrieved chunks
Context Recall	Is all needed info in the retrieved context?	Covered ground truth / Total ground truth

Context Recall requires ground-truth labels and is expensive to compute. In practice, start with Faithfulness and Context Precision — they're automatic (no ground truth needed) and catch the most common failure modes.

Build your eval pipeline →: Set up an automated eval suite on the Systems module with your own golden test cases.

→ Interactive: The Evals Lab and Eval Metrics modules in Systems Lab let you practice eval design and compare ROUGE/BERTScore/G-Eval hands-on.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →