Read time 30 minutes
Classification and regression are both supervised machine learning (ML) algorithms. These machine learning algorithms form the fundamentals of artificial intelligence (AI) we know today.
Classification and regression algorithms are also at the core of data science and predictive models. They rely on labeled data to learn the relationships between input variables (features) and output variables (targets). But their goals differ: regression models predict continuous values (like house prices or patient blood pressure), while classification models predict discrete categories (such as whether an email is spam or not, or whether a tumor is malignant or benign).
The algorithms described in this explainer can be implemented by using a Python library such as scikit learn, TensorFlow or PyTorch for sophisticated neural networks architectures. These algorithms can benefit not just beginner data scientists but also seasoned AI practitioners to improve their complex AI systems.
In an age dominated by generative AI—where models can create whimsical texts, generate images and even write code—it’s easy to forget the quiet workhorses of applied machine learning: regression and classification.
These two foundational paradigms remain central to the practical application of AI, especially in domains where structure, interpretability and precision still reign supreme. Despite the buzz around foundation models and unstructured data, much of the world's critical data remains tabular, structured and labeled—living in spreadsheets, databases or flat files. It’s in this space that regression and classification shine, especially in industries like:
Beyond predictive performance, traditional supervised learning methods offer two significant advantages:
In fact, these methods are not mutually exclusive with generative AI. They often complement it: you might use classification to select which generative model to trigger or regression to score outputs based on alignment or coherence.
In the sections that follow, we move beyond surface-level intuition to examine the theoretical foundations of both regression and classification. From their roles as function approximation problems to their implementation in real-world systems, this article offers a deeper view into how these fundamental tasks are formally defined, rigorously approached and widely applied across domains.
At a glance, classification and regression differ in a way that feels almost obvious: classification predicts a discrete value, or discrete output. Alternatively, regressions (including linear regression or polynomial regression) predict continuous numerical values or continuous outputs. But under the surface, their distinction hinges on something more fundamental—the formulation of the problem and how their loss functions are approximated.
Both classification and regression fall under the umbrella of supervised learning, meaning they learn from labeled data—datasets where each input is paired with an expected output.
Let’s unpack the core concepts that define this divide:
This distinction drives not just model choice but also how errors are defined, which algorithms are suitable and how results are interpreted. In this section, we can show you some intuitive examples. In the next few sections, we dive deep into the mathematical underpinning of function approximations for classification and regression.
Imagine you work at a hospital predicting outcomes for incoming patients:
Both tasks assume that we have a ground truth. Without labeled examples (that is, known historical outcomes), both regression and classification are not feasible.
In the generative AI era, this is an important distinction. While large language models can generate outputs from unstructured prompts, they aren’t ideal in all cases for structured decision-making where precise, reliable targets matter. In these cases, labeled datasets and interpretable models remain essential.
As AI practitioners, we must learn not only the intuition behind regression and classification, but also the rich yet elegant functions that approximate them. A fundamental goal lies at the heart of supervised machine learning: to approximate a function $f$ that maps inputs data USDXUSD to outputs USDYUSD.
Whether we're classifying emails or predicting house prices, we assume that some (potentially unknown) process generates the outcomes we observe. Statistical learning theory frames this as:
Where:
In the next subsection, we will explore how this plays out in regression and classification.
There are several different types of regressions, including:
In this section, we use linear regression as an example of how function approximation aids in building a robust model that best represents the relationship between inputs and outputs.
Let:
The unknown true function between X and y is denoted as , and we assume that the function can be approximated as a weighted sum of the input features . Note here when we make assumptions about the model, it is called a parametric method.
Linear regression is an example of a parametric model where we assume the underlying relationship is linear. The parameters, or coefficients B, are found through the residual sum of squares (RSS)
This equation allows us to find an estimated function that best represent the true function by minimizing the observed error between predicted value and actual values, and we want to minimize it.
This objective is quadratic in the parameters, leading to a convex loss surface explained in the following. Let's break down these terminologies:
Now let's apply our theoretical understanding to the previously mentioned example:
When predicting housing prices, our goal isn’t just to build a model that fits past data—it’s to build one that learns from its mistakes. Each incorrect prediction—say, overestimating the price of a small condo or undervaluing a house with a renovated kitchen—contributes to a cumulative error signal.
The loss function (in this case, RSS) measures how far off these predictions are and acts as a feedback mechanism. The optimization algorithm reads this signal and adjusts the model’s internal parameters step by step to reduce those errors. Over time, the model learns to correct its biases and make increasingly accurate predictions, not by memorizing data, but by generalizing the underlying relationships between features and outcomes.
This iterative process is what turns raw historical data into a predictive tool. The model doesn’t "know" real estate, but it internalizes the statistical patterns in the data. If it consistently makes the same kind of mistake—like undervaluing homes in a specific postal code—the loss function pushes it to recalibrate.
Eventually, it finds a balance where the overall error is minimized, enabling it to price unseen properties with greater accuracy. In this way, the loss function isn’t just a score—it’s the compass guiding the learning.
Despite the rise of complex architectures like transformers, regression remains foundational in the gen AI era. At their core, neural networks—including those powering large language models—begin with linear combinations of inputs, just like classical regression.
Each layer applies weighted sums followed by nonlinear transformations, and the entire model is trained by using gradient descent, the same optimization principle used to minimize loss in simple regression tasks. In this sense, regression isn't outdated—it’s embedded in the very mechanics of how modern AI learns, scales and adapts.
In regression, our goal is to estimate a number—for instance, the price of a house. But classification asks a different kind of question: what is the likelihood that this input belongs to category A versus category B? For example, will this email be spam or not? Will this patient develop a disease or not?
To answer these types of questions, we use probability-based models. And one of the most powerful frameworks for learning these probabilities is maximum likelihood estimation (MLE). MLE allows us to find the model parameters that make our observed data most likely under the model—a principle that is both statistically elegant and widely used in practice.
Let’s walk through how classification is framed as a function approximation problem with MLE, by using logistic regression as a simple but powerful example. You can visit the logistic regression page for details on how we derive the logistic function.
Step 1: From outputs to probabilities
First, instead of predicting a number directly, we estimate the probability that the outcome belongs to class 1 (as opposed to class 0). To do this, we take a linear combination of the inputs—like in linear regression—but then transform that value by using the logistic sigmoid function so the output stays between 0–1.
The transformation is necessary because a linear function alone can produce any real number—from negative infinity to positive infinity—but a probability must fall within the [0, 1] range. The sigmoid "squashes" the linear result into that interval, allowing us to interpret it as a probability: a value near 0 means low likelihood of class 1, near 1 means high confidence.
The logistic regression explainer mentioned before shows detailed elaboration on how this transformation is applied and derived.
Step 2: Likelihood as a function of parameters
Now we ask: given our data (input-output pairs), what choice of parameters best explains what we’ve seen? MLE formalizes this by defining a likelihood function—the probability of observing all our training labels, given the inputs and current model parameters.
If the actual label is 1, then we want the predicted probability to be high. If is 0, we want to be high.
This product accumulates the evidence across all examples—and our job is to find the parameters that make this entire product (the joint likelihood) as large as possible. The parameters that we find that can maximize the likelihood function are deemed best to model the observation.
Step 3: Taking the log for stability
Multiplying many small probabilities can lead to numerical instability (the values get tiny and incomprehensible). So instead, we take the log of the likelihood function—turning the product into a sum and making the math easier to handle. A log function simply transform a distribution with extreme values into a more stable set of values easier for computation and optimization.
This log-likelihood is now our new objective function—the thing that we want to maximize. Each term rewards the model when it assigns a high probability to the correct class.
Think of it like this: every time the model is "confident and correct," we add a positive score. But if it’s confident and wrong—say it assigns a high probability to class 1 when the true label is 0—the log-likelihood penalizes that sharply.
Step 4: Optimizing the log-likelihood
We don’t have a nice closed-form solution here (unlike linear regression), so we turn to gradient descent—an iterative optimization method. The idea is to take small steps in the direction that increases the log-likelihood the most, eventually reaching a parameter setting where the model best explains the data.
The equation shows how we move from the current parameter vector to a new one by adding a small step in the direction of the gradient.
The term log L(beta) is the log-likelihood function—it quantifies how well the model explains the observed data. The gradient ∇_beta log L tells us the direction in which this function increases most steeply. The parameter , called the learning rate, controls how large of a step we take—too large and we might overshoot the optimum; too small and training can be painfully slow.
This solution uses the same principles of gradient descent explained in the linear regression segment. In short, this equation is the engine of learning: it gradually adjusts the model to increase the likelihood of getting the right answer.
Step 5: From probabilities to decisions
Once optimized, we classify new inputs by thresholding the predicted probability. For binary classification: if the predicted probability p is greater than 0.5, then the predicted class label y is assigned as 1; otherwise, y is assigned as 0.
For multiclass classification (that is, softmax regression), we generalize this using the softmax function over class scores and choose the class with the highest probability. The threshold can be adjusted according to the discretion of the practitioners and the problems at hand.
The maximum likelihood estimate (MLE) is the conceptual backbone of modern classification. It gives us a principled way to train models that don’t just guess, but assign measurable confidence to each prediction. That’s incredibly important in domains where mistakes are costly: healthcare, criminal justice, fraud detection and more.
Even the most advanced models in gen AI, like transformers, still follow this playbook. Their final layers typically output a softmax distribution, and training is done through log-likelihood maximization. So when you’re learning logistic regression and MLE, you’re not just learning a "simple" model—you’re studying the same ideas underpinning some of the most powerful systems in AI.
Regression and classification tasks share a family of powerful supervised learning algorithms. While many algorithms can be adapted for both settings, the choice of model—and how it is configured—depends heavily on the type of output: continuous (regression) versus categorical (classification). This section introduces some of the most widely used algorithms, along with their intuitive mechanisms and mathematical underpinnings.
Decision trees are models that make predictions by asking a series of simple questions. Imagine playing a game of 20 questions—the tree starts at the top and works its way down by asking things like “Is the income greater than 50,000?” or “Is the age under 30?” Each question splits the data into smaller, more similar groups.
For classification, the goal is to group similar labels together, and the tree tries to reduce confusion (called "impurity") with each split. For regression, it tries to make each group’s average value closer to the actual numbers, reducing prediction error.
The tree keeps splitting until it reaches a stopping point—like hitting a maximum number of questions or having too few examples left to split. In the end, predictions are made based on the majority label (for classification) or the average value (for regression) in the final group, or “leaf.”
Classification trees aim to maximize information gain by reducing metrics like Gini impurity or entropy:
where is the proportion of class in node . Gini impurity and entropy are measures of how mixed or "impure" a node is in a decision tree. They help the tree decide where to split by quantifying how well a feature separates the classes—lower values mean purer nodes and better splits.
Regression trees aim to minimize the residual sum of squares (RSS) within each leaf:
where is the m-th region and is the mean of observations in that region.
A random forest builds an ensemble of decision trees, each trained on a bootstrapped subset of the data and with random feature selection at each split. Final predictions are made by averaging (regression) or majority voting (classification) across the trees. Random forests reduce variance by averaging over many diverse models. Each tree is a "noisy expert," and aggregating them leads to a robust predictor. Random feature selection ensures diversity among the trees.
KNN is a nonparametric, instance-based method that stores the entire training set. This means that unlike linear regression, KNNs do not make any assumptions of the underlying distribution of the data and have no parameters to estimate.
To make a prediction for a new input, it finds the k closest training examples and either averages their values (regression) or selects the majority class (classification). KNN predicts based on local similarity: “you are like the people closest to you.” It makes no assumptions about the underlying function, which makes it flexible but computationally expensive.
SVMs find the hyperplane—a decision boundary that best separates classes (classification) or fits a margin around data (regression), with maximal margin or minimal deviation. SVMs focus on the “hardest” data points to classify—those near the decision boundary. It ignores easy examples and tries to find the most robust separator. Kernels extend SVMs to nonlinear problems.
Naive Bayes, in plain terms, is a “count‑and‑compare” approach. Naive Bayes is most often used for classification and not regression because it is designed to assign class labels based on probabilities.
The method is called naive because it assumes each word (or feature) is independent of the others. This oversimplification works shockingly well, especially for text classification where counting word occurrences is natural and fast.
Evaluating machine learning models requires different metrics depending on whether your task is regression or classification. Choosing the right metric can help you understand model performance, compare alternatives and align with real-world business or scientific goals.
Regression metrics measure how close the predicted values are to the actual numerical outcomes.
Classification metrics evaluate how well the model separates classes.
Proportion of correct predictions.
Of all predicted positives, how many are actually positive?
Of all actual positives, how many were correctly predicted?
Harmonic mean of precision and recall.
A tabular breakdown of true, false positives and negatives. Ideal for multiclass or imbalanced datasets.
Classification and regression are two foundational pillars of supervised learning. In this article, we’ve explored how classification involves predicting discrete labels, while regression focuses on estimating continuous outcomes. We introduced some of the most widely used algorithms—from linear regression and logistic regression to decision trees, support vector machines and Bayesian methods. We also examined how they approximate functions and make predictions through statistical principles like least squares and maximum likelihood estimation.
While these models might appear elementary compared to today’s towering deep learning architectures, their principles quietly underpin many breakthroughs in modern AI.
The sigmoid function used in logistic regression is now a common activation function in neural networks. Gradient descent, first popularized in optimizing linear models, remains the backbone of how large neural networks—including transformers—are trained today. Linear combinations of input features, so central to simple models, form the first layers of most deep models. Understanding these “classical” approaches gives us a view into how complex systems like ChatGPT or BERT actually learn.
If you’re interested in diving deeper, there’s a vibrant body of research that continues to apply classification and regression in critical fields. In medicine, logistic regression is still heavily used to predict disease risk and treatment outcomes. In environmental science, regression models help quantify climate change effects. In bioinformatics, support vector machines classify genetic mutations. Far from being relics, these models remain indispensable tools of scientific progress.
Here are a few influential papers worth exploring:
Recent research continues to demonstrate the enduring relevance of these methods:
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
Yanya Lin, Jianxiong Hu, Rongbin Xu, Shaocong Wu, Fei Ma, Hui Liu, Ying Xie and Xin Li. 2023. “Application of Logistic Regression and Artificial Intelligence in the Risk Prediction of Acute Aortic Dissection Rupture.” Journal of Clinical Medicine 12 (1): 179. https://www.mdpi.com/2077-0383/12/1/179.
He-Yan Li, Li Dong, Wen-Da Zhou, Hao-Tian Wu, Rui-Heng Zhang, Yi-Tong Li, Chu-Yao Yua and Wen-Bin Wei. 2023. “Development and validation of medical record-based logistic regression and machine learning models to diagnose diabetic retinopathy.” Graefe's archive for ophthalmology. 261(3):681-689. https://pubmed.ncbi.nlm.nih.gov/36239780/.
Aaron W. Sievering, Peter Wohlmuth, Nele Geßler, Melanie A. Gunawardene, Klaus Herrlinger, Berthold Bein, Dirk Arnold, Martin Bergmann, Lorenz Nowak, Christian Gloeckner, Ina Koch, Martin Bachmann, Christoph U. Herborn and Axel Stang. 2022. “Comparison of machine learning methods with logistic regression analysis in creating predictive models for risk of critical in-hospital events in COVID-19 patients on hospital admission.” BMC Medical Informatics and Decision Making 22: Article 309. https://link.springer.com/article/10.1186/s12911-022-02057-4.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. New York: Springer. https://www.statlearning.com.
Anmol Chaure, Ashok Kumar Behera and Sudip Bhattacharya. 2023. “Finding the Perfect Fit: Applying Regression Models to ClimateBench v1.0” arXiv preprint arXiv:2308.11854. https://arxiv.org/abs/2308.11854.
Mohammad Meysami, Vijay Kumar, McKayah Pugh, Samuel Thomas Lowery, Shantanu Sur, Sumona Mondal and James M. Greene. 2023. “Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence” Frontiers in Oncology 13: 1227842. https://www.frontiersin.org/articles/10.3389/fonc.2023.1227842/full.
Trevor Hastie, Robert Tibshirani and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York: Springer. https://link.springer.com/book/10.1007/978-0-387-84858-7.