Classification versus regression

16 June 2025

Read time 30 minutes

Authors

Fangfang Lee

Developer Advocate

IBM

Introduction

Classification and regression are both supervised machine learning (ML) algorithms. These machine learning algorithms form the fundamentals of artificial intelligence (AI) we know today.

Classification and regression algorithms are also at the core of data science and predictive models. They rely on labeled data to learn the relationships between input variables (features) and output variables (targets). But their goals differ: regression models predict continuous values (like house prices or patient blood pressure), while classification models predict discrete categories (such as whether an email is spam or not, or whether a tumor is malignant or benign).

The algorithms described in this explainer can be implemented by using a Python library such as scikit learn, TensorFlow or PyTorch for sophisticated neural networks architectures. These algorithms can benefit not just beginner data scientists but also seasoned AI practitioners to improve their complex AI systems.

In an age dominated by generative AI—where models can create whimsical texts, generate images and even write code—it’s easy to forget the quiet workhorses of applied machine learning: regression and classification.

These two foundational paradigms remain central to the practical application of AI, especially in domains where structure, interpretability and precision still reign supreme. Despite the buzz around foundation models and unstructured data, much of the world's critical data remains tabular, structured and labeled—living in spreadsheets, databases or flat files. It’s in this space that regression and classification shine, especially in industries like:

  • Healthcare: Predicting disease risk (regression) or diagnosing based on symptoms (classification).
  • Finance: Estimating loan default probabilities (regression) or classifying transactions as fraudulent or legitimate (classification).
  • Public policy and social science: Modeling income distribution (regression) or classifying populations based on survey responses (classification).

Beyond predictive performance, traditional supervised learning methods offer two significant advantages:

  • Explainability: You can understand why a model makes a prediction—especially vital in regulated fields.
  • Efficiency: Classical algorithms often require fewer resources and data, making them ideal for small-scale to medium-scale deployments or embedded decision systems.

In fact, these methods are not mutually exclusive with generative AI. They often complement it: you might use classification to select which generative model to trigger or regression to score outputs based on alignment or coherence.

In the sections that follow, we move beyond surface-level intuition to examine the theoretical foundations of both regression and classification. From their roles as function approximation problems to their implementation in real-world systems, this article offers a deeper view into how these fundamental tasks are formally defined, rigorously approached and widely applied across domains.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Intuitive divide: Classification versus regression

At a glance, classification and regression differ in a way that feels almost obvious: classification predicts a discrete value, or discrete output. Alternatively, regressions (including linear regression or polynomial regression) predict continuous numerical values or continuous outputs. But under the surface, their distinction hinges on something more fundamental—the formulation of the problem and how their loss functions are approximated.

Both classification and regression fall under the umbrella of supervised learning, meaning they learn from labeled data—datasets where each input is paired with an expected output.

Let’s unpack the core concepts that define this divide:

  • In regression, the output is quantitative: a real-valued number such as revenue, temperature or forecasted values of a ticker.
  • In classification, the output is qualitative, where classes can be categorical (unordered, yes or no), nominal (unordered, like dog, cat or fish) or ordinal (ordered, like low, medium or high risk). Classification problems can also have multiclass—where an output belongs to one of the many classes or multilabel—where an output belongs to several of the available classes.

This distinction drives not just model choice but also how errors are defined, which algorithms are suitable and how results are interpreted. In this section, we can show you some intuitive examples. In the next few sections, we dive deep into the mathematical underpinning of function approximations for classification and regression.

An intuition-driven example

Imagine you work at a hospital predicting outcomes for incoming patients:

  • If you want to predict the number of days a patient stays, you’re solving a regression problem.
  • If you want to predict whether the patient needs ICU admission, that’s a classification problem.
  • If you're predicting a patient’s blood type—A, B, AB or O—you're dealing with nominal classification, where categories have no inherent order.
  • If you're forecasting the disease stage: early, moderate or severe, you're in the realm of ordinal classification—a hybrid that carries both class boundaries and order.

Both tasks assume that we have a ground truth. Without labeled examples (that is, known historical outcomes), both regression and classification are not feasible.

In the generative AI era, this is an important distinction. While large language models can generate outputs from unstructured prompts, they aren’t ideal in all cases for structured decision-making where precise, reliable targets matter. In these cases, labeled datasets and interpretable models remain essential.

Classification and regression through function approximation

As AI practitioners, we must learn not only the intuition behind regression and classification, but also the rich yet elegant functions that approximate them. A fundamental goal lies at the heart of supervised machine learning: to approximate a function $f$ that maps inputs data USDXUSD to outputs USDYUSD.

Whether we're classifying emails or predicting house prices, we assume that some (potentially unknown) process generates the outcomes we observe. Statistical learning theory frames this as:

 Y=f(X)+ε

Where:

  •  f(X)is the true function (or ground truth distribution),
  •  ε is irreducible error (random noise that cannot be eliminated no matter what)
  •  f(^X)is our model's approximation.

In the next subsection, we will explore how this plays out in regression and classification.

Regression: Function approximation through least squares

There are several different types of regressions, including:

  • Simple linear regression: the simplest and most intuitive regression where the output is a function as a simple linear combination of the output.
  • Polynomial regression: a regression technique where the output is modeled as a nonlinear, polynomial function of the input, allowing curves instead of straight lines.
  • Multiple regression: a regression method where the output depends on multiple input variables, allowing the model to capture the combined effect of several predictors on the outcome.
  • Ridge regression (L2 regression): a regularized version of linear regression that discourages large coefficients by penalizing their squared values, helping to reduce overfitting when features are correlated.
  • Lasso regression (L1 regression): a technique that not only penalizes large coefficients but can drive some to exactly zero, making it useful for selecting only the most important features.
  • Elastic net regression: a hybrid method that blends L1 and L2 regularization to balance feature selection and coefficient shrinkage, especially helpful when features are highly correlated or when there are more predictors than observations.
  • Logistic regression: a classification algorithm that uses a logistic function to model the probability of a binary outcome based on input features.
  • Autoregressive (AR) models: time series models where the current value is predicted as a linear combination of its own previous values.

In this section, we use linear regression as an example of how function approximation aids in building a robust model that best represents the relationship between inputs and outputs.

Let:

 f^(X)=β0+j=1pβjxj

The unknown true function between X and y is denoted as f, and we assume that the function f can be approximated as a weighted sum βj of the input features xj . Note here when we make assumptions about the model, it is called a parametric method.

Linear regression is an example of a parametric model where we assume the underlying relationship is linear. The parameters, or coefficients B, are found through the residual sum of squares (RSS)

 RSS=i=1n(yi-f^(xi))2=i=1n(yi-β0-j=1pβjxij)2

This equation allows us to find an estimated function f^ that best represent the true function f by minimizing the observed error between predicted value and actual values, and we want to minimize it.

This objective is quadratic in the parameters, leading to a convex loss surface explained in the following. Let's break down these terminologies:

  1. Quadratic: These are functions that are shaped like a smooth, curved bowl. No matter where you start on the surface, the curve always slopes down toward a single lowest point in the center, called the global minimum. When graphed, they form a U-shaped curve. Mathematically, a quadratic function includes squared terms (like x2, β2) which create that curvature. This is why, in linear regression, the loss function— which measures how far off predictions are—is quadratic in the parameters: it produces a predictable, convex surface that we can optimize efficiently by using gradient descent.
  2. Convexity: A function is convex if the line segment between any two points on the graph lies above the graph itself. In other words, there's only one global minimum, and no local minima that might trap optimization algorithms. The RSS is convex in relation to the parameters beta, which is why optimization is tractable. Visually, the loss surface—a plot of the RSS as a function of beta is shaped like a multidimensional bowl. This matters because convexity guarantees that the optimization problem has no local minima aside from the global one. This also makes methods like gradient descent reliable.
  3. Gradient descent: an iterative optimization algorithm used to minimize a loss function. At each step, we compute the gradient (that is, the vector of partial derivatives), which points in the direction of steepest ascent—and we move in the opposite direction to descend toward the minimum.

Now let's apply our theoretical understanding to the previously mentioned example:

When predicting housing prices, our goal isn’t just to build a model that fits past data—it’s to build one that learns from its mistakes. Each incorrect prediction—say, overestimating the price of a small condo or undervaluing a house with a renovated kitchen—contributes to a cumulative error signal.

The loss function (in this case, RSS) measures how far off these predictions are and acts as a feedback mechanism. The optimization algorithm reads this signal and adjusts the model’s internal parameters step by step to reduce those errors. Over time, the model learns to correct its biases and make increasingly accurate predictions, not by memorizing data, but by generalizing the underlying relationships between features and outcomes.

This iterative process is what turns raw historical data into a predictive tool. The model doesn’t "know" real estate, but it internalizes the statistical patterns in the data. If it consistently makes the same kind of mistake—like undervaluing homes in a specific postal code—the loss function pushes it to recalibrate.

Eventually, it finds a balance where the overall error is minimized, enabling it to price unseen properties with greater accuracy. In this way, the loss function isn’t just a score—it’s the compass guiding the learning.

Despite the rise of complex architectures like transformers, regression remains foundational in the gen AI era. At their core, neural networks—including those powering large language models—begin with linear combinations of inputs, just like classical regression.

Each layer applies weighted sums followed by nonlinear transformations, and the entire model is trained by using gradient descent, the same optimization principle used to minimize loss in simple regression tasks. In this sense, regression isn't outdated—it’s embedded in the very mechanics of how modern AI learns, scales and adapts.

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

Classification: Function approximation through maximum likelihood

In regression, our goal is to estimate a number—for instance, the price of a house. But classification asks a different kind of question: what is the likelihood that this input belongs to category A versus category B? For example, will this email be spam or not? Will this patient develop a disease or not?

To answer these types of questions, we use probability-based models. And one of the most powerful frameworks for learning these probabilities is maximum likelihood estimation (MLE). MLE allows us to find the model parameters that make our observed data most likely under the model—a principle that is both statistically elegant and widely used in practice.

Let’s walk through how classification is framed as a function approximation problem with MLE, by using logistic regression as a simple but powerful example. You can visit the logistic regression page for details on how we derive the logistic function. 

Step 1: From outputs to probabilities

First, instead of predicting a number directly, we estimate the probability that the outcome belongs to class 1 (as opposed to class 0). To do this, we take a linear combination of the inputs—like in linear regression—but then transform that value by using the logistic sigmoid function so the output stays between 0–1. 

 P(y=1|x)=11+e-(β0+j=1pβjxj)=σ(β0+xβ)

The transformation is necessary because a linear function alone can produce any real number—from negative infinity to positive infinity—but a probability must fall within the [0, 1] range. The sigmoid "squashes" the linear result into that interval, allowing us to interpret it as a probability: a value near 0 means low likelihood of class 1, near 1 means high confidence.

The logistic regression explainer mentioned before shows detailed elaboration on how this transformation is applied and derived.

Step 2: Likelihood as a function of parameters

Now we ask: given our data (input-output pairs), what choice of parameters best explains what we’ve seen? MLE formalizes this by defining a likelihood function—the probability of observing all our training labels, given the inputs and current model parameters.

 L(beta)=i=1nP(yi|xi;beta)=i=1n(pi^)yi(1-pi^)1-yi

If the actual label yi is 1, then we want the predicted probability pi to be high. If yi is 0, we want (1-pi) to be high.

This product accumulates the evidence across all examples—and our job is to find the parameters that make this entire product (the joint likelihood) as large as possible. The parameters that we find that can maximize the likelihood function are deemed best to model the observation.

Step 3: Taking the log for stability

Multiplying many small probabilities can lead to numerical instability (the values get tiny and incomprehensible). So instead, we take the log of the likelihood function—turning the product into a sum and making the math easier to handle. A log function simply transform a distribution with extreme values into a more stable set of values easier for computation and optimization.

 logL(beta)=i=1n[yilog(pi^)+(1-yi)log(1-pi^)]

This log-likelihood is now our new objective function—the thing that we want to maximize. Each term rewards the model when it assigns a high probability to the correct class.

Think of it like this: every time the model is "confident and correct," we add a positive score. But if it’s confident and wrong—say it assigns a high probability to class 1 when the true label is 0—the log-likelihood penalizes that sharply.

Step 4: Optimizing the log-likelihood

We don’t have a nice closed-form solution here (unlike linear regression), so we turn to gradient descent—an iterative optimization method. The idea is to take small steps in the direction that increases the log-likelihood the most, eventually reaching a parameter setting where the model best explains the data.

 β(t+1)=β(t)+ηβlogL(β(t))

The equation shows how we move from the current parameter vector β(t)to a new one β(t+1) by adding a small step in the direction of the gradient.

The term log L(beta) is the log-likelihood function—it quantifies how well the model explains the observed data. The gradient ∇_beta log L tells us the direction in which this function increases most steeply. The parameter η, called the learning rate, controls how large of a step we take—too large and we might overshoot the optimum; too small and training can be painfully slow.

This solution uses the same principles of gradient descent explained in the linear regression segment. In short, this equation is the engine of learning: it gradually adjusts the model to increase the likelihood of getting the right answer.

Step 5: From probabilities to decisions

Once optimized, we classify new inputs by thresholding the predicted probability. For binary classification: if the predicted probability p is greater than 0.5, then the predicted class label y is assigned as 1; otherwise, y is assigned as 0.

For multiclass classification (that is, softmax regression), we generalize this using the softmax function over class scores and choose the class with the highest probability. The threshold can be adjusted according to the discretion of the practitioners and the problems at hand.

The maximum likelihood estimate (MLE) is the conceptual backbone of modern classification. It gives us a principled way to train models that don’t just guess, but assign measurable confidence to each prediction. That’s incredibly important in domains where mistakes are costly: healthcare, criminal justice, fraud detection and more.

Even the most advanced models in gen AI, like transformers, still follow this playbook. Their final layers typically output a softmax distribution, and training is done through log-likelihood maximization. So when you’re learning logistic regression and MLE, you’re not just learning a "simple" model—you’re studying the same ideas underpinning some of the most powerful systems in AI.

Common regression versus classification algorithms

Regression and classification tasks share a family of powerful supervised learning algorithms. While many algorithms can be adapted for both settings, the choice of model—and how it is configured—depends heavily on the type of output: continuous (regression) versus categorical (classification). This section introduces some of the most widely used algorithms, along with their intuitive mechanisms and mathematical underpinnings.

Decision trees

Decision trees are models that make predictions by asking a series of simple questions. Imagine playing a game of 20 questions—the tree starts at the top and works its way down by asking things like “Is the income greater than 50,000?” or “Is the age under 30?” Each question splits the data into smaller, more similar groups.

For classification, the goal is to group similar labels together, and the tree tries to reduce confusion (called "impurity") with each split. For regression, it tries to make each group’s average value closer to the actual numbers, reducing prediction error.

The tree keeps splitting until it reaches a stopping point—like hitting a maximum number of questions or having too few examples left to split. In the end, predictions are made based on the majority label (for classification) or the average value (for regression) in the final group, or “leaf.”

Classification trees aim to maximize information gain by reducing metrics like Gini impurity or entropy:

 Gini(t)=1-k=1Kpk2

where pk is the proportion of class k in node t. Gini impurity and entropy are measures of how mixed or "impure" a node is in a decision tree. They help the tree decide where to split by quantifying how well a feature separates the classes—lower values mean purer nodes and better splits.

Regression trees aim to minimize the residual sum of squares (RSS) within each leaf:

 RSS=m=1MiRm(yi-yRm¯)2

where Rm is the m-th region and ȳRm is the mean of observations in that region.

Random forests

A random forest builds an ensemble of decision trees, each trained on a bootstrapped subset of the data and with random feature selection at each split. Final predictions are made by averaging (regression) or majority voting (classification) across the trees. Random forests reduce variance by averaging over many diverse models. Each tree is a "noisy expert," and aggregating them leads to a robust predictor. Random feature selection ensures diversity among the trees.

K-nearest neighbors (KNNs)

KNN is a nonparametric, instance-based method that stores the entire training set. This means that unlike linear regression, KNNs do not make any assumptions of the underlying distribution of the data and have no parameters to estimate.

To make a prediction for a new input, it finds the k closest training examples and either averages their values (regression) or selects the majority class (classification). KNN predicts based on local similarity: “you are like the people closest to you.” It makes no assumptions about the underlying function, which makes it flexible but computationally expensive.

Support vector machines (SVMs)

SVMs find the hyperplane—a decision boundary that best separates classes (classification) or fits a margin around data (regression), with maximal margin or minimal deviation. SVMs focus on the “hardest” data points to classify—those near the decision boundary. It ignores easy examples and tries to find the most robust separator. Kernels extend SVMs to nonlinear problems.

Naive Bayes

Naive Bayes, in plain terms, is a “count‑and‑compare” approach. Naive Bayes is most often used for classification and not regression because it is designed to assign class labels based on probabilities. 

  • Start with what you already believe. You have a prior guess about how common each class is—say, how often emails are spam versus not spam.
  • Look at the new evidence. For every word (feature) in an email, you ask, “How often does this word appear in spam compared with nonspam?”
  • Combine the two with Bayes’ rule. Multiply the prior class frequency by the “word‑given‑class” frequencies and you get a score for each class. The class with the highest score wins.

The method is called naive because it assumes each word (or feature) is independent of the others. This oversimplification works shockingly well, especially for text classification where counting word occurrences is natural and fast.

Evaluation metrics

Evaluating machine learning models requires different metrics depending on whether your task is regression or classification. Choosing the right metric can help you understand model performance, compare alternatives and align with real-world business or scientific goals.

Regression metrics

Regression metrics measure how close the predicted values are to the actual numerical outcomes.

  • Mean squared error (MSE): penalizes larger errors more heavily due to squaring.

 MSE=1ni=1n(yi-yi^)2

  • Root mean squared error (RMSE): puts error back on the same scale as the original data.

 RMSE=MSE=1ni=1n(yi-yi^)2

  • Mean absolute error: less sensitive to outliers than MSE

 MAE=1ni=1n|yi-yi^

  • R2R-squared is the variance of the distribution

 R2=1-(yi-yi^)2(yi-y¯)2

 

Classification metrics

Classification metrics evaluate how well the model separates classes.

  • Accuracy

Proportion of correct predictions.

 Accuracy=TP+TNTP+TN+FP+FN

  • Precision

Of all predicted positives, how many are actually positive?

 Precision=TPTP+FP

  • Recall (sensitivity)

Of all actual positives, how many were correctly predicted?

 Recall=TPTP+FN

  • F1 score

Harmonic mean of precision and recall.

 F1=2*Precision*Recall/(Precision+Recall)

  • Confusion matrix

A tabular breakdown of true, false positives and negatives. Ideal for multiclass or imbalanced datasets.

 

Conclusion and further reading

Classification and regression are two foundational pillars of supervised learning. In this article, we’ve explored how classification involves predicting discrete labels, while regression focuses on estimating continuous outcomes. We introduced some of the most widely used algorithms—from linear regression and logistic regression to decision trees, support vector machines and Bayesian methods. We also examined how they approximate functions and make predictions through statistical principles like least squares and maximum likelihood estimation.

While these models might appear elementary compared to today’s towering deep learning architectures, their principles quietly underpin many breakthroughs in modern AI.

The sigmoid function used in logistic regression is now a common activation function in neural networks. Gradient descent, first popularized in optimizing linear models, remains the backbone of how large neural networks—including transformers—are trained today. Linear combinations of input features, so central to simple models, form the first layers of most deep models. Understanding these “classical” approaches gives us a view into how complex systems like ChatGPT or BERT actually learn.

If you’re interested in diving deeper, there’s a vibrant body of research that continues to apply classification and regression in critical fields. In medicine, logistic regression is still heavily used to predict disease risk and treatment outcomes. In environmental science, regression models help quantify climate change effects. In bioinformatics, support vector machines classify genetic mutations. Far from being relics, these models remain indispensable tools of scientific progress.

Here are a few influential papers worth exploring:

Recent research continues to demonstrate the enduring relevance of these methods:

  • Medical diagnostics: A 2023 study developed and validated logistic regression and machine learning models to diagnose diabetic retinopathy, highlighting the utility of traditional statistical methods in clinical settings.
  • Climate modeling: Researchers applied regression models to the ClimateBench dataset, showcasing their effectiveness in emulating complex climate systems and aiding in policy-making decisions.
  • Healthcare risk prediction: A 2023 study used logistic regression to predict the risk of acute aortic dissection rupture, demonstrating its applicability in emergency medicine.
  • Oncology research: Logistic regression was employed in 2023 to compare risk factors in disease modeling with imbalanced data, specifically examining vitamin D and cancer incidence.
  • COVID-19 prognostics: A 2022 study compared machine learning methods with logistic regression analysis to predict critical in-hospital events in COVID-19 patients, emphasizing the continued importance of traditional models in emerging health crises.

 

Related solutions
IBM® watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo
References

Yanya Lin, Jianxiong Hu, Rongbin Xu, Shaocong Wu, Fei Ma, Hui Liu, Ying Xie and Xin Li. 2023. “Application of Logistic Regression and Artificial Intelligence in the Risk Prediction of Acute Aortic Dissection Rupture.” Journal of Clinical Medicine 12 (1): 179. https://www.mdpi.com/2077-0383/12/1/179.

He-Yan Li, Li Dong, Wen-Da Zhou, Hao-Tian Wu, Rui-Heng Zhang, Yi-Tong Li, Chu-Yao Yua and Wen-Bin Wei. 2023. “Development and validation of medical record-based logistic regression and machine learning models to diagnose diabetic retinopathy.” Graefe's archive for ophthalmology. 261(3):681-689. https://pubmed.ncbi.nlm.nih.gov/36239780/.

Aaron W. Sievering, Peter Wohlmuth, Nele Geßler, Melanie A. Gunawardene, Klaus Herrlinger, Berthold Bein, Dirk Arnold, Martin Bergmann, Lorenz Nowak, Christian Gloeckner, Ina Koch, Martin Bachmann, Christoph U. Herborn and Axel Stang. 2022. “Comparison of machine learning methods with logistic regression analysis in creating predictive models for risk of critical in-hospital events in COVID-19 patients on hospital admission.” BMC Medical Informatics and Decision Making 22: Article 309. https://link.springer.com/article/10.1186/s12911-022-02057-4.

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. New York: Springer. https://www.statlearning.com.

Anmol Chaure, Ashok Kumar Behera and Sudip Bhattacharya. 2023. “Finding the Perfect Fit: Applying Regression Models to ClimateBench v1.0” arXiv preprint arXiv:2308.11854. https://arxiv.org/abs/2308.11854.

Mohammad Meysami, Vijay Kumar, McKayah Pugh, Samuel Thomas Lowery, Shantanu Sur, Sumona Mondal and James M. Greene. 2023. “Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence” Frontiers in Oncology 13: 1227842. https://www.frontiersin.org/articles/10.3389/fonc.2023.1227842/full.

Trevor Hastie, Robert Tibshirani and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York: Springer. https://link.springer.com/book/10.1007/978-0-387-84858-7.