How to Evaluate LLM Systems: RAGAS, G-Eval, and Custom Grading
What groundedness, faithfulness, and citation accuracy actually measure — and how to build an eval pipeline that catches real failures before users do.
**Readable at any point in the path.** After this post you'll know how to measure whether your AI system is actually working: human evals, LLM-as-judge, automated metrics — and the conditions under which each one lies to you.
You can't eyeball your way to production-ready LLM systems. Human review doesn't scale. You need automated evaluation pipelines that catch regressions before users do. This is the hardest part of LLMOps — and the most skipped.
Why LLM eval is hard
- No single correct answer: unlike classification, there are many valid responses to most prompts
- Ground truth is expensive: labelling 1,000 examples costs real time and money
- Metrics don't align with quality: BLEU and ROUGE are cheap but don't capture what users care about
- Distributional shift: your eval set goes stale as user behaviour evolves
LLM-as-judge: the practical standard
Use a strong model (GPT-4o, Claude Opus) to grade the outputs of your weaker production model. The judge scores on dimensions like correctness, faithfulness, relevance, and tone. This scales infinitely and correlates well with human judgement (0.8+ on most benchmarks).
JUDGE_PROMPT = """
You are grading an AI assistant's response.
Question: {question}
Retrieved context: {context}
Assistant response: {response}
Grade on:
1. Faithfulness (0-1): Is every claim in the response supported by the context?
2. Relevance (0-1): Does the response actually answer the question?
3. Completeness (0-1): Does the response cover all key points from the context?
Respond in JSON: {"faithfulness": 0.X, "relevance": 0.X, "completeness": 0.X, "reason": "..."}
"""
RAGAS: the RAG evaluation framework
RAGAS (Retrieval-Augmented Generation Assessment) provides four metrics that together cover the full RAG pipeline. Run it on a golden test set of 100–200 labelled Q&A pairs before every pipeline change.
| Metric | What it measures | Formula |
|---|---|---|
| Faithfulness | Are answer claims supported by context? | Supported claims / Total claims |
| Answer Relevancy | Does the answer address the question? | Embedding sim(question, answer) |
| Context Precision | Is retrieved context actually relevant? | Relevant chunks / Retrieved chunks |
| Context Recall | Is all needed info in the retrieved context? | Covered ground truth / Total ground truth |
Context Recall requires ground-truth labels and is expensive to compute. In practice, start with Faithfulness and Context Precision — they're automatic (no ground truth needed) and catch the most common failure modes.
Build your eval pipeline →: Set up an automated eval suite on the Systems module with your own golden test cases.
→ Interactive: The Evals Lab and Eval Metrics modules in Systems Lab let you practice eval design and compare ROUGE/BERTScore/G-Eval hands-on.
- RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al., 2023)
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
- LM-Evaluation-Harness — EleutherAI
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →