What is a confusion matrix?

Young businesswoman standing by the window with mobile phone at startup office

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Business Development + Partnerships

IBM Research

The confusion matrix helps assess classification model performance in machine learning by comparing predicted values against actual values for a dataset.

A confusion matrix (or, error matrix) is a visualization method for classifier algorithm results. More specifically, it is a table that breaks down the number of ground truth instances of a specific class against the number of predicted class instances. Confusion matrices are one of several evaluation metrics measuring the performance of a classification model. They can be used to calculate a number of other model performance metrics, such as precision and recall, among others.

Confusion matrices can be used with any classifier algorithm, such as Naïve Bayes, logistic regression models, decision trees, and so forth. Because of their wide applicability in data science and machine learning models, many packages and libraries come preloaded with functions for creating confusion matrices, such scikit-learn’s sklearn.metrics module for Python.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

The confusion marix layout

In a confusion matrix, columns represent the predicted values of a given class while rows represent the actual values (i.e. ground truth) of a given class, or vice-versa. Note that the reverse also appears in research. This grid structure is a convenient tool for visualizing model classification accuracy by displaying the number of correct predictions and incorrect predictions for all classes alongside one another.

A standard confusion matrix template for a binary classifier may look like this:

The top-left box provides the number of true positives (TP), being the number of correct predictions for the positive class. The box beneath it is false positives (FP), those negative-class instances incorrectly identified as positive cases. These are also called type I errors in statistics. The top-right box is the number of false negatives (FN), the actual positive instances erroneously predicted negative. Finally, the bottom-right box displays the number of true negatives (TN), which are the actual negative class instances accurately predicted negative. Totaling up each of these values would provide the model’s total number of predictions.¹

Of course, this template is for a rudimentary binary classification problem. The confusion matrix can visualize results for multiclass classification problems as well. For example, imagine that we are developing a species classification model as part of a marine life conservation program. The model predicts fish species. A confusion matrix for such a multiclass classification problem may look like this:

The diagonal boxes all indicate true predicted positives. The other boxes provide quantities for false positives, false negatives, and true negatives depending on which class one chooses to focus.

Mixture of Experts | 20 February, episode 95

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

Using the confusion matrix for model evaluation

Given its readily accessible visualization of classifier predictive results, the confusion matrix is useful for calculating other model evaluation metrics. Values can simply be pulled from the matrix and plugged into a number of equations for measuring model performance.

Accuracy

Model accuracy is not a wholly informative evaluation metric for classifiers. For instance, imagine we run a classifier on a data set of 100 instances. The model’s confusion matrix shows only one false negative and no false positives; the model correctly classifies every other data instance. Thus the model has an accuracy of 99%. Though ostensibly desirable, high accuracy is not in itself indicative of excellent model performance. For instance, say our model aims to classify highly contagious diseases. That 1% misclassification poses an enormous risk. Thus, other evaluation metrics can be used to provide a better picture of classification algorithm performance.

Precision and recall

Precision is the proportion of positive class predictions that actually belong to the class in question.² Another way of understanding precision is that it measures the likelihood a randomly chosen instance belongs to a certain class.³ Precision may also be called positive predicted value (PPV). It is represented by the equation:

Recall denotes the percentage of class instances detected by a model.⁴ In other words, it indicates the proportion of positive predictions for a given class out of all actual instances of that class.⁵ Recall is also known as sensitivity or true positive rate (TPR) and is represented by the equation:

F1 score

Precision and recall can share an inverse relationship at times. As a model increases recall by returning more actual class instances (i.e. true positives), the model will inevitably misclassify non-instances (i.e. false positives) as well, thereby decreasing precision.⁶ The F1 score attempts to combine precision and recall to resolve this tradeoff.

The F1 score—also called F-score, F-measure, or the harmonic mean of precision and recall—combines precision and recall to represent a model’s total class-wise accuracy. Using these two values, one can calculate the F1 score with the equation, where P denotes precision (PPV) and R denotes recall (sensitivity):

The F1 score is particularly useful for imbalanced datasets, in which the precision-recall tradeoff can be most apparent. For example, say we have a classifier predicting the likelihood of a rare disease. A model that predicts no one in our test dataset has the disease may have perfect precision yet zero recall. Meanwhile, a model that predicts everyone in our dataset has the disease would return perfect recall but precision equal to the percentage of people who actually have the disease (e.g. 0.00001% if only one in every ten million have the disease). The F1 score is a means of balancing these two values to obtain a more holistic view of a classifier’s performance.⁷

Some researchers criticize the use of the F1 score as a performance metric. Such arguments typically claim that the F1 score gives equal weight to precision and recall, which may not be equally important performance metrics for all datasets.⁸ In response, researchers have proffered modified variants of the F1 score.⁹

Conditional measures

Conditional measures signify a model’s accuracy rate for detecting a certain class or non-class. Recall, also known as true positive rate (TPR) or sensitivity, is one such measure, indicating the ratio of positive class predictions out of all actual class instances. Specificity—or, true negative rate (TNR)—is the other conditional measure. It measures the proportion of correct negative predictions out of actual non-instances of a given class. One can compute specificity with the equation:¹⁰

False positive rate

Specificity helps calculate a model’s false positive rate (FPR). Other classifier evaluation visualizations, notably ROC curve and AUC, utilize FPR. FPR is the probability that a model will falsely classify a non-instance of a certain class as part of that class. Thus, per its name, it represents the rate at which a model returns false positives, known as type I errors in statistics.

While type I errors refer to false positives, type II errors denote false negatives, actual instances of a given class erroneously classified as not part of that class. Per its name, the false negative rate (FNR) denotes the probability a model erroneously classifies an actual class instance as not part of that class. Much as FPR corresponds to specificity, FNR corresponds to sensitivity:

Note that FNR is often not used in literature because it requires knowing the total number of actual instances for a given class, which can remain unknown in unseen test datasets.¹¹

Unconditional metrics

Unconditional metrics are those that represent the chances of a specific class occurring or not occurring according to the model. Precision—or, positive predicted value (PPV)—is one unconditional metric. As mentioned, it measures the likelihood that a chosen instance belongs to a certain class. The other unconditional metric, negative predicted value (NPV), is the probability that a chosen instance will not belong to that class. Essentially, both unconditional metrics attempt to answer whether a randomly chosen instance will belong to a specific class or not. One can compute NPV with the equation:¹²

Data science and MLOps for data leaders

Join forces with other leaders to drive the three essential pillars of MLOps and trustworthy AI: trust in data, trust in models and trust in processes.

Resources

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention and supercharge growth with agentic AI.

Level up your ML expertise

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Machine learning explained

Techsplainers by IBM breaks down the essentials of machine learning, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Explore IBM Granite

IBM® Granite® is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

AI in Action Report

We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.

Footnotes

¹ Kai Ming Ting. "Confusion matrix." Encyclopedia of Machine Learning and Data Mining. Springer. 2018.

² Ethan Zhang and Yi Zhang. "Precision." Encyclopedia of Database Systems. Springer. 2018.

³ Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer. 2016.

⁴ Ethan Zhang and Yi Zhang. "Recall." Encyclopedia of Database Systems. Springer. 2018.

⁵ Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer. 2016.

⁶ Ben Carterette. "Precision and Recall." Encyclopedia of Database Systems. Springer. 2018.

⁷ Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press. 2016. https://www.deeplearningbook.org/. Kevin Murphy. Machine Learning: A Probabilistic Perspective. MIT Press. 2012.

⁸ David Hand and Peter Christen. "A note on using the F-measure for evaluating record linkage algorithms." Statistics and Computing. Vol. 28. 2018. pp. 539–547. https://link.springer.com/article/10.1007/s11222-017-9746-6.

⁹ David Hand, Peter Christen, and Nishadi Kirielle. "F*: an interpretable transformation of the F-measure." Machine Learning. Vol. 110. 2021. pp. 451–456. https://link.springer.com/article/10.1007/s10994-021-05964-1. Davide Chicco and Giuseppe Jurman. "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." BMC Genomics. Vol. 21. 2020. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7.

¹⁰ Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer. 2016.

¹¹ Allen Downey. Think Stats. 2nd edition. O’Reilly. 2014.

¹² Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer. 2016.