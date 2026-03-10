Some models address “classification problems,” meaning they carve up the world into categories. Classification metrics are similarly blunt. Model accuracy is fairly intuitive: It takes the number of correct predictions and divides that by the total number. (In machine learning, the word “prediction” refers to the educated guesses models make—even if the guess is about something happening now, rather than in the future.)

The problem with model accuracy is that a high number can lull stakeholders into a false sense of security. A model meant to detect a rare but catastrophic event (say, a certain kind of cancer) might reflexively classify every scan as negative. It would receive high model accuracy, because 99.99% of those negative readings would be correct. But this high accuracy would be cold comfort to the poor patient who received the rare false negative. The model was accurate in a technical sense, but didn’t do its job.

It becomes useful to carve up a classification model’s performance into the types of predictions, or educated guesses, that it makes. In a binary classification task—like cancer detection—there are four possible outcomes (when arrayed in a 2x2 grid, this framework is often called a “confusion matrix”):

True positives (cancer detected accurately) True negatives (cancer ruled out accurately) False positives (cancer detected, but this was inaccurate) False negatives (cancer not detected, and this was inaccurate)

Already one can begin to see why it is worthwhile to break out these categories. A false positive cancer diagnosis would no doubt be traumatic, until further testing revealed the episode to be a medical scare. But a false negative reading can be fatal.

Data science practitioners have developed an array of submetrics to probe the performance of classifiers and assess the relationships among the quadrants of the confusion matrix.

The metric called precision asks, of all the positive predictions a classifier made, how many were correct?

A car-mounted image-recognizing algorithm passes 10 intersections on a test course, six of which have stop signs. Yet to say a model “caught all six stop signs” would be to elide key potential differences in precision. If it flagged all six accurately and did not produce false positives, then it had a precision of 6/6, or 100%. However, if it flagged those six but also hallucinated four stop signs that were not there, its precision was only 6/10, or a mere 60%.

The metric called recall (also known as the “true positive rate”) measures something subtly different. Recall asks, of all the stop signs that were indeed there, how many did the model catch?

Imagine another test course with 100 intersections, 50 of which have stop signs. A model that catches 30 of these stop signs would have a recall of 60%; 40 of these, 80%; and so on. (Recall doesn’t concern itself with false alarms, so in theory one can “game” 100% recall by teaching a model to see stop signs everywhere.)

These two metrics, precision and recall, exist in tension. An engineer seeking to improve recall might overshoot the mark, creating a model that too often gives false positives. Often, tuning a model amounts to managing tradeoffs between higher recall (catching all of the phenomenon you seek to detect) and lower precision (overshooting the mark and catching false positives as well).

In managing this tradeoff, machine learning practitioners often use a metric called an F1 score, which is a “harmonic mean” of precision and recall. (A harmonic mean differs from the more traditional average in that it is disproportionately affected by low values. An F1 score thus drops quickly if either precision or recall is low.)

A perfect F1 score would be 1.0, but unfortunately there is no one-size-fits-all guidance for what is a sufficiently high F1 score, with context mattering greatly.2 What’s clear is that a higher F1 score is better. The nearer to 1.0, the better this model can effectively detect what it is meant to detect, while minimizing false positives and false negatives.3