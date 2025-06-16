In regression, our goal is to estimate a number—for instance, the price of a house. But classification asks a different kind of question: what is the likelihood that this input belongs to category A versus category B? For example, will this email be spam or not? Will this patient develop a disease or not?

To answer these types of questions, we use probability-based models. And one of the most powerful frameworks for learning these probabilities is maximum likelihood estimation (MLE). MLE allows us to find the model parameters that make our observed data most likely under the model—a principle that is both statistically elegant and widely used in practice.

Let’s walk through how classification is framed as a function approximation problem with MLE, by using logistic regression as a simple but powerful example. You can visit the logistic regression page for details on how we derive the logistic function.

Step 1: From outputs to probabilities

First, instead of predicting a number directly, we estimate the probability that the outcome belongs to class 1 (as opposed to class 0). To do this, we take a linear combination of the inputs—like in linear regression—but then transform that value by using the logistic sigmoid function so the output stays between 0–1.

P ( y = 1 | x ) = 1 1 + e - ( β 0 + ∑ j = 1 p β j x j ) = σ ( β 0 + x ⊤ β )

The transformation is necessary because a linear function alone can produce any real number—from negative infinity to positive infinity—but a probability must fall within the [0, 1] range. The sigmoid "squashes" the linear result into that interval, allowing us to interpret it as a probability: a value near 0 means low likelihood of class 1, near 1 means high confidence.

The logistic regression explainer mentioned before shows detailed elaboration on how this transformation is applied and derived.

Step 2: Likelihood as a function of parameters

Now we ask: given our data (input-output pairs), what choice of parameters best explains what we’ve seen? MLE formalizes this by defining a likelihood function—the probability of observing all our training labels, given the inputs and current model parameters.

L ( b e t a ) = ∏ i = 1 n P ( y i | x i ; b e t a ) = ∏ i = 1 n ( p i ^ ) y i ( 1 - p i ^ ) 1 - y i

If the actual label y i is 1, then we want the predicted probability p i to be high. If y i is 0, we want ( 1 - p i ) to be high.

This product accumulates the evidence across all examples—and our job is to find the parameters that make this entire product (the joint likelihood) as large as possible. The parameters that we find that can maximize the likelihood function are deemed best to model the observation.

Step 3: Taking the log for stability

Multiplying many small probabilities can lead to numerical instability (the values get tiny and incomprehensible). So instead, we take the log of the likelihood function—turning the product into a sum and making the math easier to handle. A log function simply transform a distribution with extreme values into a more stable set of values easier for computation and optimization.

log L ( b e t a ) = ∑ i = 1 n [ y i log ( p i ^ ) + ( 1 - y i ) log ( 1 - p i ^ ) ]

This log-likelihood is now our new objective function—the thing that we want to maximize. Each term rewards the model when it assigns a high probability to the correct class.

Think of it like this: every time the model is "confident and correct," we add a positive score. But if it’s confident and wrong—say it assigns a high probability to class 1 when the true label is 0—the log-likelihood penalizes that sharply.

Step 4: Optimizing the log-likelihood

We don’t have a nice closed-form solution here (unlike linear regression), so we turn to gradient descent—an iterative optimization method. The idea is to take small steps in the direction that increases the log-likelihood the most, eventually reaching a parameter setting where the model best explains the data.

β ( t + 1 ) = β ( t ) + η ∇ β log L ( β ( t ) )

The equation shows how we move from the current parameter vector β ( t ) to a new one β ( t + 1 ) by adding a small step in the direction of the gradient.

The term log L(beta) is the log-likelihood function—it quantifies how well the model explains the observed data. The gradient ∇_beta log L tells us the direction in which this function increases most steeply. The parameter η , called the learning rate, controls how large of a step we take—too large and we might overshoot the optimum; too small and training can be painfully slow.

This solution uses the same principles of gradient descent explained in the linear regression segment. In short, this equation is the engine of learning: it gradually adjusts the model to increase the likelihood of getting the right answer.

Step 5: From probabilities to decisions

Once optimized, we classify new inputs by thresholding the predicted probability. For binary classification: if the predicted probability p is greater than 0.5, then the predicted class label y is assigned as 1; otherwise, y is assigned as 0.

For multiclass classification (that is, softmax regression), we generalize this using the softmax function over class scores and choose the class with the highest probability. The threshold can be adjusted according to the discretion of the practitioners and the problems at hand.

The maximum likelihood estimate (MLE) is the conceptual backbone of modern classification. It gives us a principled way to train models that don’t just guess, but assign measurable confidence to each prediction. That’s incredibly important in domains where mistakes are costly: healthcare, criminal justice, fraud detection and more.

Even the most advanced models in gen AI, like transformers, still follow this playbook. Their final layers typically output a softmax distribution, and training is done through log-likelihood maximization. So when you’re learning logistic regression and MLE, you’re not just learning a "simple" model—you’re studying the same ideas underpinning some of the most powerful systems in AI.