How to Evaluate RAG Systems. Beyond Pass/Fail

If you’ve ever used Notebook LLM from Google, you’ve experienced retrieval-augmented generation (RAG).

Notebook LLM is powered by Google’s large language model, Gemini. Its primary purpose is to help users obtain reliable information about the data they have submitted to it.

Once someone uploads their documents to a Notebook LLM instance (called simply a “notebook”), the service indexes the attached information and uses it as reference for generating answers to questions.

This approach is the principle behind RAG that allows it to minimize, although not entirely eliminate, hallucinations about the source material it has been provided.

RAG systems are used across different industries as interactive knowledge bases, specialized chatbots, and more. For example, a clinic might want to use a RAG-based system to make it easier for the staff to get information about patients. Instead of browsing through files to check someone’s age, a physician might simply open the chatbot and ask: “How old is Jacob?” This ostensibly straightforward question that Jacob himself would answer in a heartbeat relies on a very complicated process of encoding, embedding, and retrieval on one hand, and continuous, multi-faceted testing on the other.

For a basic primer about the first part of the equation – encoding, embedding, and retrieval – refer to our earlier article on building SQL-to-RAG pipelines.

Here, we’ll focus on the other half: evaluation.

We’ve already covered evaluation in classification models.

Both classification models and RAG systems are machine learning technologies that we need to test for reliability. RAG, however, is a completely different beast: it can do the same tasks as classification models and more, outclassing them in range and complexity. This necessitates a distinct approach to testing.

Read on to discover:

Why testing machine learning models is crucial regardless of their complexity?

Why use different types of scoring for the same evaluation?

What Are Best Practices for RAG System Evaluation?

What are the challenges, variables, metrics, and steps in evaluating RAG systems?

What is the difference between LLM judges and human evaluators?

Why do we test machine learning models?

We test machine learning models to make sure they don’t just memorize the training data but can actually extrapolate the patterns they learned to new, unseen examples. We call this property generalization.

Without the ability to generalize, a model is like a parrot tied to a textbook. Give it a new input, even one that’s just a little different, and it will often fail - sometimes dramatically. For example, imagine a spam filter that only blocks word-for-word copies of old spam emails and lets new phishing variations right through.

Word-for-word spam blocking is a good example of overfitting. This phenomenon occurs when a model latches so closely onto its training data that it can’t adapt to anything new. Underfitting is the opposite: in this case, the model isn’t effective even on the kind of data it has been trained on.

Everything in model evaluation is designed to prevent these errors from happening and deliver a reliable system that can adapt to a variety of real-world scenarios.

Our previous guide discussed evaluation in the context of classification models. Today, we turn our attention to RAG, or retrieval augmented generation.

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-augmented generation might be construed as a system that narrows down the focus of large language models (LLMs), such as those used in chatbots, question-answering systems, and summarization tools.

RAG supplements an LLM’s ability to generate text by adding a retrieval step: when a user submits a query, the system does not rely exclusively on information stored in its internal parameters (i.e., what it learned during training). Instead, RAG actively searches external data sources (such as databases, APIs, document repositories, or internal knowledge bases) for the most relevant, up-to-date information. It then augments the LLM’s prompt by adding this retrieved data as additional context before generating a final answer.

In addition, we can set up RAG to route queries to external sources based on user intent – for example, if a physician asks about Jacob’s age, the system might automatically connect to an SQL database with patients’ personal information; and if the question concerns Jacob’s recent bloodwork, the query might get directed to a lab’s database.

As we can see, retrieval-augmented generation is several orders of magnitude more complex than classification. To begin with, RAG functions as an extension of an already intricate system – a large language model. On top of that, RAG itself relies on a series of sophisticated processes, such as encoding, vector similarity search, and retrieval (more on that in one of our previous articles).

No wonder, then, that evaluation of RAG systems presents far more challenges than that of classification models.

The Challenges of Evaluating RAG Systems

Unlike classification, where every input gets a set label, RAG systems create full, often nuanced, text responses. Hence, evaluation is not a simple question of comparing labels. RAG output varies across length, structure, style, and many other, crucial dimensions. It’s not simply a matter of right or wrong.

For example, think back to the physician asking the RAG-based system: “How old is Jacob?” and the system replying: “Jacob is in the 32nd year of his life.” Even if not technically incorrect, that awkward phrasing will likely confuse the doctor. After all, “being in the 32nd year of one’s life” actually means “being thirty-one years old”.

Evaluation has to account for this, which is why we can’t test the output for correctness only. We also need to assess how factually close it is, if and how it used retrieved information, whether it stuck to the user’s question, and more.

All in all, evaluating RAG systems requires accounting for at least six metrics: correctness, answer relevance, groundedness, context relevance, style, and latency. These sample metrics are illustrated in Figure 1.

Fig. 1: Metrics used in evaluating RAG systems.

RAG system evaluation metrics

The six metrics used in RAG evaluation are determined based on four variables: context, query, response, and reference answer.

Context refers to the background information encoded into the model, such as Jacob’s age, birthplace, residence, family relations, medical history, and more.
Query is a question that the user might conceivably ask the model.
Response is the model’s reply to that question.
Reference answer is the “key” (as in: an answer key) used to compare the model’s reply against the expected outcome – much like a teacher might use the key at the end of a handbook to see if a student solved a puzzle correctly.

Let’s see how these variables translate into the various metrics in RAG system evaluation.

Correctness

When testing for correctness, we’re asking: does the generated response align with a reference answer even if the phrasing is different? For example, “the 32nd year of my life” is correct if the reference answer to “How old is Jacob” is “31.”

Answer relevance

Answer relevance looks at how well the response actually addresses the user’s question (i.e. query). Is the information on topic or does it wander? For “How old is Jacob”, a reply of low relevance could be ”Jacob is a young adult in his early thirties”.

Groundedness

Then we have groundedness, which checks if the model used the context information or if it hallucinated something different and possibly irrelevant. For our question about Jacob’s age, we’re checking what the supporting documents (i.e. context) say and whether it aligns with the model’s answer.

Context relevance

Alongside that, context relevance measures whether the model retrieved the right context for the query. For example, did the model pull the document that mentions Jacob’s birth year, or did it also look at information about his siblings, residence, medical history, etc., which would be irrelevant to the query and might contaminate the reply.

Latency and style

Finally, two other important factors include latency and style.

Latency indicates the speed with which the system replies. Style refers to the answer’s wording, tone, and conciseness.

Why are they important? Well, imagine you’re the doctor asking a model about Jacob’s age. The model is supposedly trained to answer questions specifically about Jacob (and other patients), but for some reason you wait over a minute for the reply. In that time, you could calculate the age yourself or even call Jacob and ask him directly.

Similarly, suppose the answer is prompt (pun intended) but so convoluted that it takes a while to decode or is simply incomprehensible. For example, it wouldn’t be of much help to know that Jacob is 1700 weeks old. Our initial answer, “Jacob is in the 32nd year of his life”, also falls under this category because it is awkwardly phrased – despite passing all the other tests from correctness to context relevance.

Although funny, these challenges are far from theoretical. In fact, age is a perfect example of what stops large language models in their tracks: numbers, especially embedded in structured data. Generative AI models famously struggle with everything that isn’t phrased in natural language, which is why they might lag on questions that require calculations or interacting with databases such as SQL.

Facing a similar issue? Our article on building efficient SQL-to-RAG pipelines describes a solution to this common predicament.

The process of evaluation

Now that we know the metrics and variables, we can create a testing environment and proceed to evaluation.

Figure 2 illustrates the basic process that guides RAG system evaluation from development to release and beyond. First, we build a test suite. We ran the tests, evaluate responses, aggregate and analyze results, tweak tests, and repeat the process when necessary.

Fig. 2: RAG evaluation test cycle.

Let's take a look at how this plays out over the development lifecycle.

Building the test suite

Testing a RAG system is a bit like testing a human student on their knowledge with a set of flashcards. The “flashcards” in this analogy represent a set of real questions and answers (the “reference answers” in Fig. 1 above) that the model should be able to provide on its own.

These test questions are mostly generated by the LLM with prompt templates but always reflect true user scenarios. The idea is to cover a broad range of topics to avoid cherry-picking easy cases.

Running the tests

In development

When the test suite is ready, we feed the questions to the system and capture its responses. LLM judges and/or human evaluators score the replies across the evaluation metrics, from correctness to latency and style.

The key with RAG is to tailor the strictness of evaluation to the task. The best judge setups have instructions that account for valid paraphrasing, numerical rounding, and other acceptable deviations. Otherwise, we’d be getting a lot of false negatives; “Jacob is thirty-one” could get flagged as wrong if the system expected to see a number (31).

We then aggregate the results. If certain types of questions are consistently getting bad scores, that points to a weakness. Maybe the retrieval needs tuning, prompt templates should be rewritten, or the LLM post-processing is at fault.

We experiment by tweaking one aspect at a time to isolate a variable and run the test set again. The process continues as we add new templates and discover edge cases until the overall score is solid enough to push the system to production.

But the story doesn’t end there.

In production

LLM-based systems like RAG are non-deterministic, meaning their behavior can’t be fully predicted—even if the system itself doesn’t change after deployment.

But it does change. RAG-based systems rely on internal knowledge, which can change (Jacob moves houses) or expand (he registers his kids at the clinic).

In addition, the underlying LLM can also get upgraded over time, potentially affecting the RAG system’s performance.

When the RAG system gets a new version, we benchmark it against the old one, sometimes running two test suites in parallel for A/B testing performance.

During regular operation, we watch the system closely, tune prompts, fix the data, and refine LLM’s judge instructions.

Ideally, though, the testing suite should become largely automated after enough cycles.

Continuous testing RAG Systems

The goal is to reach a point where we trust the system to flag real issues without having to constantly intervene. Instead, we routinely check numerical results until there’s a new pattern or a significant shift in how users interact with the system.

That said, vigilance is essential. Production logs sometimes reveal new vulnerability patterns, such as particular kinds of questions failing. When that happens, we design new scenarios, add more question templates, update the test set, or even change how the underlying data is formatted or retrieved – until the overall score goes up again.

Judges in RAG Evaluation

LLM judges

Most of the time, especially at scale, the LLM itself evaluates the RAG output. We create prompt templates with specific evaluation instructions, including placeholders for user prompts, system’s replies, and, if available, reference answers.

We learned early on that LLM judges need detailed instructions. For example, when the expected answer is a number and the system returns the correct information expressed in a different way - say, as an ordinal or a rounded value - the judge might sometimes flag it as wrong until we adjust the evaluation prompt.

Human judges

LLMs aren’t perfect. Sometimes, they’re too rigid; other times, they get biased towards their own answers. They also tend to miss subtleties in tone or style that only a person can catch.

For aspects like ambiguity or subjective user experience, we bring in human reviewers – ideally already trained in the type of tasks they would be annotating. These reviewers might go through a sample of responses, especially on new types of queries or after big changes, to check for issues the automated process misses. Sometimes, to introduce even more nuance, we have several annotators review the same outputs to surface discrepancies in judgement and point to especially difficult cases.

Scoring in RAG Evaluation

Depending on the kind of question and the metric we’re testing for, we might prompt the judge to classify the output as “good/bad” or score it on a specific scale (e.g. from 0 to 1).

Oftentimes, however, we do both. This double-track approach allows us to discover false negatives by analyzing discrepancies between judges. For example, the answer “Jacob is in his 32nd year of life” might get a “good” rating from one judge but “0.5” from another, pointing to a potential problem with the output. If we only relied on one judge, the problem could slip through the cracks.

Best Practices for RAG System Evaluation

The cornerstone of RAG evaluation is developing a realistic, diverse test suite with a mix of simple and complex scenarios.

The test suite should run whenever the data, prompts, or code change, not only at the end of every ingestion pipeline cycle.

Ultimately, the goal is to automate the evaluation and only make changes as new user behaviors, feedback votes, data drifts, or edge cases emerge in production.

Finally, compare the test results not only to yesterday’s performance but also to other, established systems. If you’re building a domain-specific chatbot and it can’t match the relevance of a general-purpose tool like GPT, that’s a sign you need to go back and rework the fundamentals.