Testing Classification Models in Machine Learning: A Lead-in to RAG Evaluation

August 12, 2025

By Hubert Brychczynski

Machine Learning,
Classification,
Testing,
Rag

When was the last time someone offered you a million dollars for clicking a link in their message? If you don’t remember, your spam filter is doing its job well.

Spam filters work because we trained them on thousands of emails. But did we just hope for the best afterwards?

Of course not. Spam filters, medical‑imaging assistants, recommendation engines, LLMs—all machine‑learning models must be tested to ensure their efficacy and reliability.

This article is the first in a three-part series on ML evaluation, setting the stage for a deep dive into retrieval-augmented generation (RAG) testing.

What is RAG?

“RAG” stands for retrieval-augmented generation. Broadly speaking, RAG adds an extra layer of context to large language model interfaces, allowing them to respond more precisely to context-specific prompts. RAG systems are used in chatbots, recommendation engines, and text-summarization solutions.

Before we get into RAG evaluation, however, we’ll kick the series off with an overview of testing classification models.

What are classification models in machine learning?

As the name suggests, a classification model assigns each data point (e.g., an email) one or more labels from a set of predefined classes, based on patterns it learned in training.

Classification is a subset of supervised machine learning with applications not only in spam filters but also in diagnostics, document processing, image recognition, and weather or traffic pattern prediction.

The reason we start with classification models is that they’re great for demonstrating the fundamental principles behind machine learning evaluation.

Once we get our footing—by getting a general idea of what ML testing is in the context of classification models—we’ll move on to more sophisticated testing methods involved in evaluating RAG systems. Finally, we’ll devote the last part to a hypothetical case study: designing a testing pipeline for a restaurant chatbot.

What Makes Machine Learning Models Reliable?

A machine‑learning model is considered useful if it can correctly generalize to inputs it has never seen before, assuming those inputs are sampled from the same distribution as the training data, though the real world is rarely that clean.

For instance, we wouldn’t blame a spam filter (at least not yet) for failing to block a message with a deepfake video of a friend attached. But we would be disappointed if it didn’t bounce an email from a “king” promising us a hefty inheritance.

Why Test Machine Learning Models?

We test machine‑learning models to ensure their capacity for generalization. The goal is to arrive at a “Goldilocks zone,” where the model produces accurate predictions on novel, context‑relevant examples without being too rigid (overfitted) or too volatile (underfitted).

Overfitting occurs when a model relies on its training data so heavily that it cannot extrapolate beyond it; it responds reliably only to inputs identical to the training data.
Underfitting occurs when the model hasn’t been trained enough and lacks the pattern‑recognition capability to make reliable predictions, whether based strictly on training data or not.

Evaluation also lets us compare models or configurations: How does Model A compare with Model B? Does Setting X lead to better outcomes than Setting Y? And so on.

What Metrics Do We Use to Evaluate Classification Models in Machine Learning?

In the case of classification models, a basic way to gauge their performance is by using the confusion matrix (Figure 1).

Fig. 1: Confusion matrix

The matrix compares the expected output (real labels) with the model’s actual predictions (predicted labels) and sorts the results into four categories:

True positive (TP): Both the real and predicted labels are positive.
False positive (FP): The real label is negative, but the prediction is positive.
False negative (FN): The real label is positive, but the prediction is negative.
True negative (TN): Both labels are negative.

The confusion matrix can be used to evaluate performance with various metrics. For the purpose of this article, we’ll discuss four: accuracy, precision (P), recall (R), and F1 score.

Other metrics relevant to classification include ROC curves and their area under the curve (AUC), especially useful when you need to pick a decision threshold.

What Evaluation Criteria and Formulas Do We Use?

Suppose we take 100 emails and ask a spam filter to label them. We then compare the model’s predictions with our labels and fill the confusion matrix (Figure 2):

Fig. 2: Hypothetical confusion matrix for spam classification

From this we can calculate accuracy, precision, recall, and F1 score.

Accuracy

Dividing true positives plus true negatives by the total number of predictions yields accuracy (Figure 3).

Fig. 3: Accuracy formula

Accuracy is a general metric that shows overall correctness, but it can be misleading with imbalanced datasets.

For example, assume our batch of 100 emails contains only 5 spam messages, and the model classifies everything correctly. Five true positives and 95 true negatives translate into 100 percent accuracy, yet we still couldn’t trust the model given such a small sample of true positives.

Precision (P)

Dividing true positives by all predicted positives gives precision (Figure 4).

Fig. 4: Precision formula

High precision means false positives are rare, so we won’t often find regular email in the spam folder.

Recall (R)

Recall is calculated as true positives divided by all real positives, which include false negatives (Figure 5).

Fig. 5: Recall formula

Low recall means the model tends to mistake positives for negatives. While relatively harmless in spam filtering, low recall can have dire consequences in fields such as medical testing.

F1 Score

The F1 score combines precision and recall to balance the two (Figure 6).

Fig. 6: F1 score formula

A high F1 score indicates that, on average, both false positives and false negatives are low.

Although the F1 score seems the most comprehensive metric, sometimes precision matters more than recall (as in spam filtering) or vice versa (e.g., medical imaging). In such cases, it’s better to focus on the most relevant metric rather than the F1 score alone.

Which Errors Can You Afford Based on Use Case?

Not every case is optimized for the same type of mistake. A false positive can be trivial when an airport scanner triggers a manual bag check, while a false negative in the same situation could end in a disaster. On the other hand, losing a legitimate email due to a false positive in a spam filter may sting more than letting a spam message slip through (false negative).

Pick metrics that reflect your real costs. For example, precision when false positives hurt more and recall when false positives do.

For even finer control, you can also add weights to precision and recall to reflect your preference of one over the other. In any case, look at the individual metrics instead of just a single composite if you want a full picture of model performance.

How Do We Execute Evaluation?

Step 1: Splitting the Data

Before testing, we must separate training data from test data. Otherwise, the model would be tested on the same data used for training and might overfit.

There are two common ways to split data: the Train/Validation/Test split and K‑Fold cross‑validation.

Train/Validation/Test Split

In this approach, we divide the data into three sets: a training set (about 60–80 percent), a validation set (10–20 percent), and a test set (10–20 percent). We use the training set to teach the model and the validation set to tune parameters. Only after we have chosen those settings do we use the test set to see how well the model works on data it has never seen. This ensures our final score reflects true performance on new examples.

K‑Fold Cross‑Validation

K‑Fold cross‑validation is more sophisticated. We randomly split the dataset into K equal folds (columns in Figure 7).

Fig. 7: K‑Fold cross‑validation

For instance, 100 emails could form five folds of 20 emails each.

We then perform K runs, training the model on K – 1 folds (four in our example) and testing it on the remaining one. We repeat this until every fold has served as the test set exactly once.

This approach ensures that the training and testing data are both equivalent (because they originate from the same source) and distinct (because they’re randomly separated), which prevents overfitting and properly evaluates the model’s capacity to generalize.

K‑Fold cross‑validation is especially useful when data are scarce or when creating custom test sets isn’t feasible.

Evaluating Chatbots

Machine learning powers both spam filters and large language models, but LLMs perform much more nuanced tasks. We can think of a spam filter as a binary classifier—its job is to decide whether an email is spam (1) or not (0). The output of an LLM, by contrast, rarely can be evaluated so categorically.

Consequently, the metrics and methods for chatbot evaluation differ from those used for binary classifiers. We still need to separate training data from testing, so the Train/Validation/Test split and K‑Fold cross‑validation still apply, but running and evaluating tests is a different matter.

Come back next time to learn more about evaluating RAG‑based systems. In the meantime, read up on developing an SQL‑to‑RAG pipeline to enhance the speed and accuracy of chatbots that rely on databases!

Frequently Asked Questions

A confusion matrix compares real labels to model predictions and counts true/false positives and negatives. It’s the basis for accuracy, precision, recall, and F1 in classification models.

Choose precision when false positives are costly (e.g., mislabeling legitimate email as spam). Choose recall when false negatives are riskier (e.g., medical screening). Align the metric with your real-world costs.

K-Fold cross-validation splits data into K folds, trains on K−1 and tests on the remaining fold, rotating until each fold is tested once. It gives a more reliable performance estimate and helps reduce overfitting, especially with limited data.