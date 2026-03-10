Model evaluation refers to the process of measuring how well a machine learning model performs. This process asks the question: When your model makes a judgment call about the real world, how often is it right? Or, in cases on a spectrum, how close to being right was it?
Increasingly, real money is at stake with firms’ reliance on AI models. In February 2021, leaders at Zillow made a large bet based on their machine learning models that predicted the values of homes. Not only would it make these estimates, but often Zillow would itself purchase the homes its model priced, through a related business called Zillow Offers.
Just eight months later, Zillow wound down Zillow Offers and took a USD 304 million inventory write-down. The cause, the company said, was buying many homes for prices above what it believed it could sell them for. The company’s stock dove, and Zillow laid off about 25% of its staff.
To blame? Its AI model wasn’t accurate enough to weather the market ahead. Its predictions and forecasts did not match the actual values of homes.1
As ML models spread into healthcare, hiring and criminal justice, the cost of poor evaluation can cause real harm to real people. Across data science and industry, getting model evaluation metrics right has become an important part of deploying AI responsibly.
Different models are meant to do different things.
Classification models label incoming data as belonging to one of a few categories. (A model that flags a patient as having sepsis, or not, is a classification model.)
Regression models instead output a number along a continuum. (Zillow’s home prices model was a regression model.)
The different model types require different kinds of testing. Often, triangulating performance through multiple metrics is ideal because no single metric is without its uncertainties.
Some models address “classification problems,” meaning they carve up the world into categories. Classification metrics are similarly blunt. Model accuracy is fairly intuitive: It takes the number of correct predictions and divides that by the total number. (In machine learning, the word “prediction” refers to the educated guesses models make—even if the guess is about something happening now, rather than in the future.)
The problem with model accuracy is that a high number can lull stakeholders into a false sense of security. A model meant to detect a rare but catastrophic event (say, a certain kind of cancer) might reflexively classify every scan as negative. It would receive high model accuracy, because 99.99% of those negative readings would be correct. But this high accuracy would be cold comfort to the poor patient who received the rare false negative. The model was accurate in a technical sense, but didn’t do its job.
It becomes useful to carve up a classification model’s performance into the types of predictions, or educated guesses, that it makes. In a binary classification task—like cancer detection—there are four possible outcomes (when arrayed in a 2x2 grid, this framework is often called a “confusion matrix”):
Already one can begin to see why it is worthwhile to break out these categories. A false positive cancer diagnosis would no doubt be traumatic, until further testing revealed the episode to be a medical scare. But a false negative reading can be fatal.
Data science practitioners have developed an array of submetrics to probe the performance of classifiers and assess the relationships among the quadrants of the confusion matrix.
The metric called precision asks, of all the positive predictions a classifier made, how many were correct?
A car-mounted image-recognizing algorithm passes 10 intersections on a test course, six of which have stop signs. Yet to say a model “caught all six stop signs” would be to elide key potential differences in precision. If it flagged all six accurately and did not produce false positives, then it had a precision of 6/6, or 100%. However, if it flagged those six but also hallucinated four stop signs that were not there, its precision was only 6/10, or a mere 60%.
The metric called recall (also known as the “true positive rate”) measures something subtly different. Recall asks, of all the stop signs that were indeed there, how many did the model catch?
Imagine another test course with 100 intersections, 50 of which have stop signs. A model that catches 30 of these stop signs would have a recall of 60%; 40 of these, 80%; and so on. (Recall doesn’t concern itself with false alarms, so in theory one can “game” 100% recall by teaching a model to see stop signs everywhere.)
These two metrics, precision and recall, exist in tension. An engineer seeking to improve recall might overshoot the mark, creating a model that too often gives false positives. Often, tuning a model amounts to managing tradeoffs between higher recall (catching all of the phenomenon you seek to detect) and lower precision (overshooting the mark and catching false positives as well).
In managing this tradeoff, machine learning practitioners often use a metric called an F1 score, which is a “harmonic mean” of precision and recall. (A harmonic mean differs from the more traditional average in that it is disproportionately affected by low values. An F1 score thus drops quickly if either precision or recall is low.)
A perfect F1 score would be 1.0, but unfortunately there is no one-size-fits-all guidance for what is a sufficiently high F1 score, with context mattering greatly.2 What’s clear is that a higher F1 score is better. The nearer to 1.0, the better this model can effectively detect what it is meant to detect, while minimizing false positives and false negatives.3
Within classification metrics, two metrics involve the related concepts of confidence and thresholds.
A model does not simply spit out “stop sign” or “not stop sign.” Rather, it says something like, “There is a 98% chance that this is a stop sign” (a highly confident prediction). Or it says, “There is a 51% chance this is a stop sign” (a not very confident prediction).
The metric known as log loss is designed to evaluate a model’s confidence. Highly confident mistakes receive a large penalty. Low confidence around correct predictions is also penalized, though to a lesser degree. A perfect model would score 0 on log loss, though that is rarely achieved. What constitutes a “good” score again depends on your model and task type.
Whatever a model’s confidence score, the human users of ML models must ultimately decide upon a threshold to turn a model’s hunches into final yes-or-no judgment calls. One threshold might institute the rule, “if >75% confident, then output ‘yes, a stop sign.’” But a human user might just as well choose a threshold of 51% confidence or 98% confidence instead. The resulting outputs from the model can of course vary greatly depending on whatever threshold is chosen.
A ROC curve (after the technical phrase “receiver operating characteristic”) and the related metric ROC AUC (or “area under curve”) probe the model’s performance at many different thresholds. Technically, a ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) as the threshold varies. Conceptually, ROC AUC sets aside judgment calls at any particular cutoff, instead observing overall whether a model is good at sorting: “Regardless of where we set the threshold, is the model at least consistently outputting higher confidence scores when stop signs are indeed there?” ROC AUC summarizes this overall ability to separate positives from negatives.
The preceding section treats “classification” problems, where a phenomenon (be it a stop sign or cancer) is straightforwardly present, or not. But many phenomena (home values, a patient’s glucose levels) occur on a spectrum, requiring different models and different performance measures. Models that address these phenomena output numbers rather than categories. They are called regression models and are evaluated with regression metrics, which ask in various ways, “How far off the mark is that number?”
Mean absolute error (MAE) asks, “On average, how far off were we?” If a model this week thinks a home will sell for USD 500,000 and it sells for USD 525,000, and then next week thinks a home will sell for USD 400,000 and it sells for USD 390,000, its MAE is USD 17,500 (25,000 + 10,000, divided by 2). MAE ignores whether a model is consistently over or under in its predictions. It just looks at the average distance from the truth.
Root mean squared error (RMSE) is similar, but it assigns a harsher penalty to numbers that were far off the mark. It achieves this by squaring errors—which makes big errors even larger—before taking the square root of the resulting average. The RMSE in the previous example is USD 19,039. (The related MSE, or mean squared error, works similarly but without the square root, making it less interpretable but mathematically useful sometimes.) RMSE is useful when large errors are especially costly.
A less intuitive metric is R-squared. R-squared measures not how far off a model’s predictions were, but rather how much of the target variable’s overall variation the model managed to explain.
To get a sense of R-squared, first imagine a simplistic home-price model that spits out the same value for every single home: the average price for the area. R-squared asks: How much better is our model than the pure average-guesser? The better the model captures the variance in actual prices, the higher its R-squared. (An R-squared of 0.85 means that the model explains about 85% of the variation in the outcome; an R-squared of 0 means it’s no better than the average-spewing model.)
Like all metrics, R-squared is imperfect. It’s particularly weak with data containing outliers.
Not every student who passes a test has truly learned the material. The student might have memorized flash cards but not internalized concepts. The student might have cheated, somehow seeing the test in advance, The student might simply have gotten lucky. It is the same with machine learning models.
One rudimentary mistake in evaluating machine learning models would be to test the model on the same data used for model training. The model might perform very well, but simply because it has essentially memorized the data. It has failed to generalize any learning about the underlying phenomenon it is meant to detect, and it is likely to fail when it encounters new data in the real world. The technical term for this memorization-like behavior is overfitting.
The usual safeguard is called a train-test split: One divides the available data into a set that the model is allowed to learn from (training data) and another portion it is not allowed to see until the exam (the test set). But this safegaurd, too, can give imperfect results, because an unlucky split can skew the model’s test results. Furthermore, if data is limited, there is a painful tradeoff between using data for training versus preserving it for testing.
ML practitioners address these problems with cross-validation. With cross-validation, a dataset is divided into so-called folds. Most folds are used to train the model, while one is reserved to test it. Then, the process is repeated on a fresh copy of the model, with the folds rotated; a different fold is now the test set. The test scores from these various runs are averaged. This approach gives a more stable estimate of how well the model is likely to perform on new data, while also getting more mileage out of a limited dataset (because each datapoint can be used for training in one context, testing in another).
Ultimately, if none of the candidate models performs sufficiently well, practitioners might try hyperparameter tuning—adjusting built-in settings such as model depth or learning rate—to see whether performance improves.
In Python, libraries like scikit-learn make cross-validation simple to implement, which is one reason it has become standard practice.
Sometimes the so-called “ground truth” is clear-cut: The patient does or doesn’t have cancer; the house sold for this or that amount. But with the advent of large language models (LLMs), model performance is often less clear cut or easy to measure.
An LLM-fueled chatbot might face some binary tasks, like whether it gets facts right or wrong. But its user may also evaluate it along many different, difficult-to-define dimensions, like friendliness or helpfulness. In such cases, there is no single correct answer, no “true values” to benchmark against. Human annotation is considered the gold standard for evaluating LLM outputs, but it’s a method that doesn’t scale.
Ultimately, in such cases, the final model evaluation might come from launching a model into the wild and seeing whether users derive value from it or not.
